<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.2">Jekyll</generator><link href="https://blog.labxq.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.labxq.com/" rel="alternate" type="text/html" /><updated>2024-04-01T11:59:06-07:00</updated><id>https://blog.labxq.com/feed.xml</id><title type="html">BlogXQ</title><entry><title type="html">GPU cloud: architecture and mechanism</title><link href="https://blog.labxq.com/os/virtualization/2020/12/30/gpu-cloud-mechanism.html" rel="alternate" type="text/html" title="GPU cloud: architecture and mechanism" /><published>2020-12-30T00:00:00-08:00</published><updated>2020-12-30T00:00:00-08:00</updated><id>https://blog.labxq.com/os/virtualization/2020/12/30/gpu-cloud-mechanism</id><content type="html" xml:base="https://blog.labxq.com/os/virtualization/2020/12/30/gpu-cloud-mechanism.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>As world enters the era of artificial intelligence, the demand of parallel computing power(which is largely based on GPU) is growing in an unprecedented manner.</p>

<p>However, powerful GPU is usually expensive and can not be equipped on each device. Thus, current industry trend is to move GPU computational power from client’s machine to data center and cloud.</p>

<p>Famous cloud providers such as AWS and GCP already enable GPU instance in their virtual machine renting services. NVIDIA also develops cloud services for GPU-specific workload such as gaming and deep learning.</p>

<h2 id="gpu-cloud-architecture">GPU cloud architecture</h2>
<p>The core value of cloud is <strong>resource sharing and renting</strong>. We can assign hardware resources based on needs, which in turn optimizes efficiency and minimizes cost.</p>

<p>A virtual machine is the essential building block to divide resources. GPU cloud is no different. Unlike other hardwares like CPU, memory and storage, extra effort is needed to bring GPU into VM. <strong>The rest of the article will explain in detail how GPU resource is virtualized.</strong></p>

<p>Following is the architecture of a typical GPU cloud:
<img src="/assets/images/OS-gpu_cloud_architecture.svg" alt="gpu_cloud_architecture" /></p>

<p>The diagram depicts structure of one specific zone. Admin host manages the entire data center. GPU host provides GPU-powered VM instances to customers.</p>

<h2 id="linux-virtualization-stack">Linux virtualization stack</h2>
<p>Linux’s virtualization stack contains several layers:</p>

<p><img src="/assets/images/OS-linux_virtualization_stack.svg" alt="linux_virtualization_stack" /></p>
<h3 id="client">client</h3>
<p>This layer consists of users of virtualization. Examples are command-line tools, GUI application, and various cloud service providers.</p>
<h3 id="libvirt">libvirt</h3>
<p>This is the API provided by Red Hat to unify management of different virtualization back-ends.</p>
<h3 id="qemu">QEMU</h3>
<p>QEMU is a hardware emulator, it provides a set of different emulated hardware and device models for the virtual machine.</p>

<p>KVM runs in kernel space and contains only core virtualization logic such as vCPU and memory mapping. Thus, QEMU is a good complement for providing peripheral support.</p>

<p>When the guest OS operates emulated hardware, QEMU translates the instruction and gets it executed on host.</p>

<h2 id="kvm">KVM</h2>
<p>KVM is the virtualization module shipped with Linux kernel. Since released, it gains more and more popularity.</p>

<p>Large cloud service providers such as AWS are actively migrating from existing virtualization solution like Xen towards KVM.</p>
<h3 id="overview">Overview</h3>
<p>KVM itself is a Linux kernel module. KVM and its interaction with guest and host is as follows: 
<img src="/assets/images/OS-kvm_overview.svg" alt="kvm_overview" /></p>

<h3 id="project-structure">Project Structure</h3>
<p>KVM source code is located in Linux main branch at <a href="https://elixir.bootlin.com/linux/latest/source/virt/kvm">/virt/kvm</a>. 
Most of the logic is inside <a href="https://elixir.bootlin.com/linux/latest/source/virt/kvm">kvm_main.c</a> file.</p>

<h3 id="process-of-creating-a-vm">Process of creating a VM</h3>
<p>Creating a VM is one of the most typical workflows of KVM. Let’s walk through its process:</p>
<h4 id="1-open-kvm-device">1. Open KVM device</h4>
<p>KVM itself is represented as a device in Linux. In order for it to operate VM, it has to be opened first:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kvm_fd=open("/dev/kvm")
</code></pre></div></div>

<h4 id="2-create-vm">2. Create VM</h4>
<p>After KVM device is opened, we use <code class="language-plaintext highlighter-rouge">ioctl</code> to send command to the device asking for creating a VM:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0)
</code></pre></div></div>
<h4 id="3-create-virtual-cpu-for-the-vm">3. Create virtual CPU for the VM</h4>
<p>The core component of a VM is CPU and memory. a vCPU is represented as a file as well and created using ioctl.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0)
</code></pre></div></div>
<h4 id="4-set-up-memory-for-the-vm">4. Set up memory for the VM</h4>
<p>The memory of VM is created in user space and then mapped to VM.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void *mem = mmap(0, mem_size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
struct kvm_userspace_memory_region region = {.userspace_addr = mem, .memory_size = mem_size};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &amp;region);
</code></pre></div></div>
<h4 id="5-run-vm">5. Run VM</h4>
<p>After everything is set up, we run vCPU inside VM and wait for it to exit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while(true) {
    ioctl(vcpu_fd, KVM_RUN, 0);
}
</code></pre></div></div>
<h3 id="tips">Tips</h3>
<p>KVM API is mainly used by guest framework such as QEMU. Client usually controls behaviors of VM through QEMU command line parameters and seldom calls KVM API directly.</p>

<h2 id="gpu-resource-sharing">GPU resource sharing</h2>
<h3 id="introduction-1">Introduction</h3>
<p>The core value of cloud is sharing resources. For CPU/memory/network, this is relatively easy to implement, we can assign any fraction of resource to a VM.</p>

<p>For GPU, things are different. Most consumer grade GPUs don’t support dividing GPU resources. This means GPU can only be exclusively assigned to one VM as a whole. We can add many GPUs to PCIe slots and assign one GPU per VM, but it will be obviously annoying to manage and maintain.</p>
<h3 id="vgpu">vGPU</h3>
<p>Nvidia’s vGPU technology is used for resolving this problem. Workstation GPUs such as Quadro and Tesla support this feature.</p>
<h4 id="mediated-device-frameworkmdev">mediated device framework(mdev)</h4>
<p>This is the mechanism for implementing vGPU. In general, it is compatible with VFIO UAPI, but managed by vendor driver which controls resource sharing logic.</p>

<p>Multiple virtual devices can be generated by Mdev, each contains a fraction of GPU resource. The generated virtual device is located inside <code class="language-plaintext highlighter-rouge">/sys/bus/mdev/devices/</code> and ready to be attached to VM.</p>

<h3 id="gpu-sharing-for-container">GPU sharing for container</h3>
<p>Containerization gains more and more popularity nowadays, and it is natural to come up with the demand to share GPU among containers.</p>

<p>As of 2020, it remains an open question: <a href="https://github.com/kubernetes/kubernetes/issues/52757">Is sharing GPU to multiple containers feasible?</a></p>

<p>The root cause is, there is no way of partitioning GPU resources (CUDA cores, memory), or even assigning priorities for most non-datacenter GPUs. So before NVIDIA provides hardware support for resource sharing, we can do nothing about it🙃</p>

<h2 id="gpu-passthrough-access-gpu-from-vm">GPU Passthrough: access GPU from VM</h2>
<p>GPU is viewed as a PCIe device. It is normally attached to and accessed by host driver. In order for it to be used by VM, A special driver called <strong>vfio</strong> driver is needed to control GPU from inside VM.</p>
<h3 id="how-to-enable-gpu-passthrough">how to enable GPU passthrough</h3>
<h4 id="bind-vfio-driver-to-gpu">bind vfio driver to GPU</h4>
<p>Adding GRUB parameters is a convenient way to bind vfio driver:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GRUB_CMDLINE_LINUX_DEFAULT="vfio-pci.ids=10de:2484"
</code></pre></div></div>
<h4 id="attach-gpu-to-vm">Attach GPU to VM</h4>
<p>This can be done via command line tool or GUI application. The GPU is added to VM as a PCIe device.</p>

<p>The client calls libvirt library and libvirt adds relevant parameter indicating attached device to start qemu process. It is like:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>ps <span class="nt">-ef</span>|grep qemu
libvirt+ 1351222       1 99 16:14 ?        00:02:21 /usr/bin/qemu-system-x86_64 ... <span class="nt">-device</span> vfio-pci,host<span class="o">=</span>01:00.0
</code></pre></div></div>

<h3 id="vfio-driver">vfio driver</h3>
<h4 id="background">Background</h4>
<p>In the past, specific device assignment code has to be written for different PCI devices.</p>

<p>In order to unify the device access process of VM, VFIO driver frameworks is developed.</p>

<h4 id="mechanism">Mechanism</h4>
<p>In general, vfio is a re-mapper of device address. When guest accesses device, vfio is responsible for remapping the address to real, physical device address. Thus, the host is bypassed and guest can access physical PCIe device directly.</p>

<p>VFIO operates devices in the granularity of <strong>IOMMU group</strong>. IOMMU is used for connecting physical device’s I/O bus to main memory.</p>

<p>An IOMMU group combines multiple devices(which are usually logically related) into the same set. A typical example is NVIDIA’s Video and Audio device.</p>

<p>You can use <code class="language-plaintext highlighter-rouge">lspci</code> command to inspect IOMMU info of a pci device:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ lspci -vvv
08:00.0 VGA compatible controller: NVIDIA Corporation Device 2484 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation Device 146b
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast &gt;TAbort- &lt;TAbort- &lt;MAbort- &gt;SERR- &lt;PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 79
    IOMMU group: 13
    Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at d000 [size=128]
    Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
    Capabilities: &lt;access denied&gt;
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

</code></pre></div></div>

<h4 id="source-code-analysis">Source code analysis</h4>
<p>VFIO is a PCIe driver, so its source code is located at <code class="language-plaintext highlighter-rouge">/drivers/vfio/pci</code>.
Main logic is located at <code class="language-plaintext highlighter-rouge">vfio_pci.c</code>.</p>

<p>VFIO device is viewed just like a file which we can open/release/read/write. The operations are like:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/torvalds/linux/blob/139711f033f636cc78b6aaf7363252241b9698ef/drivers/vfio/pci/vfio_pci.c#L1884</span>
<span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="nc">vfio_device_ops</span> <span class="n">vfio_pci_ops</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">name</span>		<span class="o">=</span> <span class="s">"vfio-pci"</span><span class="p">,</span>
    <span class="p">.</span><span class="n">open</span>		<span class="o">=</span> <span class="n">vfio_pci_open</span><span class="p">,</span>
    <span class="p">.</span><span class="n">release</span>	<span class="o">=</span> <span class="n">vfio_pci_release</span><span class="p">,</span>
    <span class="p">.</span><span class="n">ioctl</span>		<span class="o">=</span> <span class="n">vfio_pci_ioctl</span><span class="p">,</span>
    <span class="p">.</span><span class="n">read</span>		<span class="o">=</span> <span class="n">vfio_pci_read</span><span class="p">,</span>
    <span class="p">.</span><span class="n">write</span>		<span class="o">=</span> <span class="n">vfio_pci_write</span><span class="p">,</span>
    <span class="p">.</span><span class="n">mmap</span>		<span class="o">=</span> <span class="n">vfio_pci_mmap</span><span class="p">,</span>
    <span class="p">.</span><span class="n">request</span>	<span class="o">=</span> <span class="n">vfio_pci_request</span><span class="p">,</span>
    <span class="p">.</span><span class="n">match</span>		<span class="o">=</span> <span class="n">vfio_pci_match</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>VFIO read/write shares the same backend called <code class="language-plaintext highlighter-rouge">vfio_pci_rw</code>. Its code is like:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/torvalds/linux/blob/139711f033f636cc78b6aaf7363252241b9698ef/drivers/vfio/pci/vfio_pci.c#L1407</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">vfio_pci_rw</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">device_data</span><span class="p">,</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span>
               <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">ppos</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">iswrite</span><span class="p">)</span> 
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">index</span> <span class="o">=</span> <span class="n">VFIO_PCI_OFFSET_TO_INDEX</span><span class="p">(</span><span class="o">*</span><span class="n">ppos</span><span class="p">);</span>
    <span class="k">struct</span> <span class="nc">vfio_pci_device</span> <span class="o">*</span><span class="n">vdev</span> <span class="o">=</span> <span class="n">device_data</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">VFIO_PCI_CONFIG_REGION_INDEX</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">vfio_pci_config_rw</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">ppos</span><span class="p">,</span> <span class="n">iswrite</span><span class="p">);</span>

    <span class="k">case</span> <span class="n">VFIO_PCI_BAR0_REGION_INDEX</span> <span class="p">...</span> <span class="n">VFIO_PCI_BAR5_REGION_INDEX</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">vfio_pci_bar_rw</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">ppos</span><span class="p">,</span> <span class="n">iswrite</span><span class="p">);</span>
    <span class="nl">default:</span>
        <span class="n">index</span> <span class="o">-=</span> <span class="n">VFIO_PCI_NUM_REGIONS</span><span class="p">;</span>
        <span class="k">return</span> <span class="n">vdev</span><span class="o">-&gt;</span><span class="n">region</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">rw</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span>
                           <span class="n">count</span><span class="p">,</span> <span class="n">ppos</span><span class="p">,</span> <span class="n">iswrite</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can see that different operations apply to different region indexes.</p>

<p>For <code class="language-plaintext highlighter-rouge">pci_bar_rw</code>, it generally contains 2 steps, first setup IO mapping, then do actual read/write.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/torvalds/linux/blob/master/drivers/vfio/pci/vfio_pci_rdwr.c#L227</span>
<span class="kt">ssize_t</span> <span class="nf">vfio_pci_bar_rw</span><span class="p">(</span><span class="k">struct</span> <span class="nc">vfio_pci_device</span> <span class="o">*</span><span class="n">vdev</span><span class="p">,</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span>
            <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">ppos</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">iswrite</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">vfio_pci_setup_barmap</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">bar</span><span class="p">);</span>
    <span class="p">...</span>
    <span class="n">do_io_rw</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">res</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">IORESOURCE_MEM</span><span class="p">,</span> <span class="n">io</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">pos</span><span class="p">,</span>
            <span class="n">count</span><span class="p">,</span> <span class="n">x_start</span><span class="p">,</span> <span class="n">x_end</span><span class="p">,</span> <span class="n">iswrite</span><span class="p">);</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The data is transferred <strong>between user space and PCIe device, with VFIO driver as the medium:</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/torvalds/linux/blob/f6e1ea19649216156576aeafa784e3b4cee45549/drivers/vfio/pci/vfio_pci_rdwr.c#L97</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">do_io_rw</span><span class="p">(</span><span class="k">struct</span> <span class="nc">vfio_pci_device</span> <span class="o">*</span><span class="n">vdev</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">test_mem</span><span class="p">,</span>
            <span class="kt">void</span> <span class="n">__iomem</span> <span class="o">*</span><span class="n">io</span><span class="p">,</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span>
            <span class="n">loff_t</span> <span class="n">off</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">x_start</span><span class="p">,</span>
            <span class="kt">size_t</span> <span class="n">x_end</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">iswrite</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ssize_t</span> <span class="n">done</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">fillable</span><span class="p">,</span> <span class="n">filled</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">fillable</span> <span class="o">&gt;=</span> <span class="mi">4</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="p">(</span><span class="n">off</span> <span class="o">%</span> <span class="mi">4</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">u32</span> <span class="n">val</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">iswrite</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">copy_from_user</span><span class="p">(</span><span class="o">&amp;</span><span class="n">val</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
                    <span class="k">return</span> <span class="o">-</span><span class="n">EFAULT</span><span class="p">;</span>
                <span class="n">ret</span> <span class="o">=</span> <span class="n">vfio_pci_iowrite32</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">test_mem</span><span class="p">,</span>
                             <span class="n">val</span><span class="p">,</span> <span class="n">io</span> <span class="o">+</span> <span class="n">off</span><span class="p">);</span>
            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="n">ret</span> <span class="o">=</span> <span class="n">vfio_pci_ioread32</span><span class="p">(</span><span class="n">vdev</span><span class="p">,</span> <span class="n">test_mem</span><span class="p">,</span>
                            <span class="o">&amp;</span><span class="n">val</span><span class="p">,</span> <span class="n">io</span> <span class="o">+</span> <span class="n">off</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">copy_to_user</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">val</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
                    <span class="k">return</span> <span class="o">-</span><span class="n">EFAULT</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="n">filled</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="n">count</span> <span class="o">-=</span> <span class="n">filled</span><span class="p">;</span>
        <span class="n">done</span> <span class="o">+=</span> <span class="n">filled</span><span class="p">;</span>
        <span class="n">off</span> <span class="o">+=</span> <span class="n">filled</span><span class="p">;</span>
        <span class="n">buf</span> <span class="o">+=</span> <span class="n">filled</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">done</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>

<h2 id="references">References</h2>
<ol>
  <li>https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine</li>
  <li>https://www.linux-kvm.org/images/5/59/02x03-Neo_Jia_and_Kirti_Wankhede-vGPU_on_KVM-A_VFIO_based_Framework.pdf</li>
  <li>https://www.kernel.org/doc/Documentation/vfio.txt</li>
  <li>https://david942j.blogspot.com/2018/10/note-learning-kvm-implement-your-own.html</li>
</ol>]]></content><author><name></name></author><category term="os" /><category term="virtualization" /><category term="linux" /><category term="gpu" /><category term="virtualization" /><category term="cloud" /><category term="kvm" /><category term="vfio" /><category term="vGPU" /><summary type="html"><![CDATA[Overview and deep-dive of GPU-powered cloud.]]></summary></entry><entry><title type="html">观海漫记</title><link href="https://blog.labxq.com/life/2020/12/24/trip-to-san-gregorio-beach.html" rel="alternate" type="text/html" title="观海漫记" /><published>2020-12-24T00:00:00-08:00</published><updated>2020-12-24T00:00:00-08:00</updated><id>https://blog.labxq.com/life/2020/12/24/trip-to-san-gregorio-beach</id><content type="html" xml:base="https://blog.labxq.com/life/2020/12/24/trip-to-san-gregorio-beach.html"><![CDATA[<p>疫情方兴未艾，人多的地方还是都没法去。想来想去，只好去看海了😎</p>

<p>开车离家后，一路向西。旧金山湾中间的平地被一圈山岭围绕，所以要到海边必须走上好长一段山路。路两侧壮观的红杉林虽美，但弯弯绕绕的盘山公路的驾驶体验实在不是很好，都要晕车了。</p>

<p>湾区附近的海滩中，以半月湾(Half Moon Bay)最为热门。想要避开人群，因此特地选了附近的另一个San Gregorio海滩。原以为隆冬时节没人会来海边， 来到目的地后，发现还是有稀稀疏疏的游人，点缀在金色的沙滩上。
<img src="/assets/images/Life-san_gregorio_beach_overview.jpg" alt="san_gregorio_beach_overview" /></p>

<p>加州海岸多为浪蚀而成。海浪常年冲击侵蚀海岸，因此在岸边形成了陡峭的悬崖。San Gregorio海滩边也有一面崖壁。顶部平坦，形成了一座天然的观景台，可以观赏海滩全景:</p>
<iframe width="100%" height="400" allowfullscreen="" style="border-style:none;" src="https://cdn.pannellum.org/2.5/pannellum.htm#panorama=https%3A//i.imgur.com/SW78Bkl.jpg&amp;autoLoad=true"></iframe>

<p>全景图拍得比较渣，可能是手机握的不稳的缘故，用三脚架固定可能效果更好些。不过这光影的震颤变幻以及支离破碎的贴图，是不是有点印象派和立体主义的风格？哈哈，希望莫奈、毕加索不要跳起来打我。</p>

<p>到海边时刚好赶上涨潮，隆隆的潮声从浩渺的太平洋上传来，蔚为雄壮。沙滩上聚集着成群结队的海鸥，和旅人们一样，多面朝大海，静立在沙滩上，思绪似乎随潮声飘向远方。</p>
<video width="100%" controls="" preload="auto">
<source src="/assets/videos/Life-san_gregorio_pacific_ocean.mp4" type="video/mp4" />
</video>

<p><br /></p>

<p>原本最大的愿望就是拍摄海上日落，结果到了之后才发现乌云完全遮住了西方天空，甚是失望。日落后，沿着蜿蜒海岸线的加州1号公路回家, 结果在路上，绚烂的火烧云显现在天空中。</p>

<!-- Courtesy of embedresponsively.com //-->

<div class="responsive-video-container">
    <iframe src="https://www.youtube-nocookie.com/embed/LH2umyi3BP4" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>
  </div>

<p>如果日落时在海滩多呆半小时，就可以完整拍到海边晚霞之景，也算不虚此行了。可见世上诸事，行百里者半九十。在这个纷繁变幻的年代，恒心才是最宝贵的东西。</p>]]></content><author><name></name></author><category term="life" /><category term="travel" /><summary type="html"><![CDATA[旧金山湾区San Gregorio海滩之旅]]></summary></entry><entry><title type="html">A Trip to Windy Hill, Portola Valley, CA</title><link href="https://blog.labxq.com/life/2020/11/17/trip-to-windy-hill.html" rel="alternate" type="text/html" title="A Trip to Windy Hill, Portola Valley, CA" /><published>2020-11-17T00:00:00-08:00</published><updated>2020-11-17T00:00:00-08:00</updated><id>https://blog.labxq.com/life/2020/11/17/trip-to-windy-hill</id><content type="html" xml:base="https://blog.labxq.com/life/2020/11/17/trip-to-windy-hill.html"><![CDATA[<p>People’s way of life changes a lot since the beginning of pandemic. With nowhere faraway to go to, driving around the living place seems to be the only way for leisure.</p>

<p>Felt bored and drove out without destination last weekend, I luckily arrived at the summit of Windy Hill, a hill in western Silicon Valley that has very good viewshed of the entire bay area.</p>

<figure class="">
  <img src="/assets/images/Life-windy_hill_summit.jpg" alt="Bay Area Overview" /><figcaption>
      Looking east towards San Francisco Bay from Windy Hill Summit.

    </figcaption></figure>

<p>The whole San Francisco Bay is surrounded by hills. The climate of eastern half is dry, seldom rains. The symbol color of California is golden. Everybody will agree on this when the distant, golden-hued and vegetation-lacking hills come into sight after leaving airport and driving on the highway.</p>

<p>The western half, where Windy Hill is located, is quite different. It faces Pacific Ocean directly. Moisture continuously flows from coast, thus the climate is humid enough for <strong>California redwood</strong> to survive.</p>

<p>California redwood is one of the tallest species on Earth, reaching up to 100 meters, shading most sunlight from reaching ground.</p>

<p>In addition, the environment is always foggy, mainly due to cool coastal air and mountainous terrain. All these factors depict a serene, even gloomy scene inside redwood forest.</p>

<video id="windy-hill-video" class="video-js vjs-fluid vjs-big-play-centered" controls="" preload="auto" data-setup="{}">
  <source src="/blog-video/windy-hill/1920x1080/video.m3u8" label="1080P" />
  <source src="/blog-video/windy-hill/1280x720/video.m3u8" label="720P" selected="true" />
  <source src="/blog-video/windy-hill/640x360/video.m3u8" label="360P" />
</video>
<script>
  var player = videojs("windy-hill-video");
  player.controlBar.addChild('QualitySelector');
</script>

<p>Above is when i drove through a redwood forest of Windy Hill. Although living just next to Redwood City, this is the first time I’m able to see the forest of California redwood.</p>]]></content><author><name></name></author><category term="life" /><category term="travel" /><summary type="html"><![CDATA[Exploration of an open space preserve near central Bay Area.]]></summary></entry><entry><title type="html">Kubernetes Project Exploration, Part 4 - Kubernetes device plugin framework and implementation of NVIDIA device plugin</title><link href="https://blog.labxq.com/cloud/2020/10/25/kubernetes-exploration-part4-k8s-nvidia-device-plugin.html" rel="alternate" type="text/html" title="Kubernetes Project Exploration, Part 4 - Kubernetes device plugin framework and implementation of NVIDIA device plugin" /><published>2020-10-25T00:00:00-07:00</published><updated>2020-10-25T00:00:00-07:00</updated><id>https://blog.labxq.com/cloud/2020/10/25/kubernetes-exploration-part4-k8s-nvidia-device-plugin</id><content type="html" xml:base="https://blog.labxq.com/cloud/2020/10/25/kubernetes-exploration-part4-k8s-nvidia-device-plugin.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Kubernetes is largely a resource manager for cluster. Among all hardware resources, only CPU and memory are natively supported.</p>

<p>Since there are lots of hardware vendors, it is unrealistic to add each hardware’s specific code to k8s main project.</p>

<p>Thus, a device plugin framework is developed to provide support for different devices. Hardware vendors need to implement “driver” for their devices to run on k8s cluster.</p>

<h2 id="use-of-device-plugin">Use of device plugin</h2>
<p>A device plugin is deployed as a <code class="language-plaintext highlighter-rouge">DaemonSet</code> on each node. It monitors device on each node and collaborates with <code class="language-plaintext highlighter-rouge">kubelet</code> to run a container with device enabled.</p>

<p>A sample yaml file for NVIDIA GPU device plugin is as follows:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">extensions/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">DaemonSet</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">spec</span><span class="pi">:</span>
    <span class="na">template</span><span class="pi">:</span>
        <span class="na">metadata</span><span class="pi">:</span>
            <span class="na">labels</span><span class="pi">:</span>
                <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">device-plugin</span>
        <span class="na">spec</span><span class="pi">:</span>
            <span class="na">containers</span><span class="pi">:</span>
                <span class="na">name</span><span class="pi">:</span> <span class="s">device-plugin-ctr</span>
                <span class="na">image</span><span class="pi">:</span> <span class="s">NVIDIA/device-plugin:1.0</span>
                <span class="na">volumeMounts</span><span class="pi">:</span>
                  <span class="pi">-</span> <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/device-plugin</span>
                  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">device-plugin</span>
            <span class="na">volumes</span><span class="pi">:</span>
             <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">device-plugin</span>
               <span class="na">hostPath</span><span class="pi">:</span>
                   <span class="na">path</span><span class="pi">:</span> <span class="s">/var/lib/kubelet/device-plugins</span>
</code></pre></div></div>
<p>After device plugin is deployed, a pod can request device from cluster, just as other types of resources. Sample YAML file is as follows:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Pod</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">demo-pod</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">containers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">demo-container-1</span>
      <span class="na">image</span><span class="pi">:</span> <span class="s">k8s.gcr.io/pause:2.0</span>
      <span class="na">resources</span><span class="pi">:</span>
        <span class="na">limits</span><span class="pi">:</span>
          <span class="na">nvidia.com/gpu</span><span class="pi">:</span> <span class="m">2</span>
</code></pre></div></div>

<h2 id="device-plugin-framework">Device plugin framework</h2>
<p>Device plugin framework is implemented in <code class="language-plaintext highlighter-rouge">kubelet</code> to provide support of managing different device plugins. It does following work:</p>

<ol>
  <li>Register device plugin.</li>
  <li>Allocate device to container.</li>
</ol>

<h3 id="plugin-registration">plugin registration</h3>
<p>A device plugin has to be registered to kubelet first in order for it to be used.</p>

<p>The registration communication is done via <code class="language-plaintext highlighter-rouge">grpc</code>. Plugin framework serves as a grpc server and device plugin is the client.</p>

<p>The registration protobuf message is as follows:</p>
<div class="language-protobuf highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/kubernetes/kubernetes/blob/296f7c91bb52cd724ce6d6d120d5d41ed459d677/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L23</span>
<span class="kd">service</span> <span class="n">Registration</span> <span class="p">{</span>
	<span class="k">rpc</span> <span class="n">Register</span><span class="p">(</span><span class="n">RegisterRequest</span><span class="p">)</span> <span class="k">returns</span> <span class="p">(</span><span class="n">Empty</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">}</span>

<span class="kd">message</span> <span class="nc">RegisterRequest</span> <span class="p">{</span>
	<span class="c1">// Version of the API the Device Plugin was built against</span>
	<span class="kt">string</span> <span class="na">version</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
	<span class="c1">// Name of the unix socket the device plugin is listening on</span>
	<span class="c1">// PATH = path.Join(DevicePluginPath, endpoint)</span>
	<span class="kt">string</span> <span class="na">endpoint</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
	<span class="c1">// Schedulable resource name. As of now it's expected to be a DNS Label</span>
	<span class="kt">string</span> <span class="na">resource_name</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
	<span class="c1">// Options to be communicated with Device Manager</span>
	<span class="n">DevicePluginOptions</span> <span class="na">options</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>From <code class="language-plaintext highlighter-rouge">RegisterRequest</code> message we can see that a version, a grpc endpoint, a resource name and some options are needed to represent a device plugin.</p>

<p><code class="language-plaintext highlighter-rouge">Register</code> method is implemented in plugin framework:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/4eadf404480e0653e29a9367841080d94ea4017c/pkg/kubelet/cm/devicemanager/manager.go#L312</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">ManagerImpl</span><span class="p">)</span> <span class="n">RegisterPlugin</span><span class="p">(</span><span class="n">pluginName</span> <span class="kt">string</span><span class="p">,</span> <span class="n">endpoint</span> <span class="kt">string</span><span class="p">,</span> <span class="n">versions</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"Registering Plugin %s at endpoint %s"</span><span class="p">,</span> <span class="n">pluginName</span><span class="p">,</span> <span class="n">endpoint</span><span class="p">)</span>

	<span class="n">e</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">newEndpointImpl</span><span class="p">(</span><span class="n">endpoint</span><span class="p">,</span> <span class="n">pluginName</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">callback</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to dial device plugin with socketPath %s: %v"</span><span class="p">,</span> <span class="n">endpoint</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">options</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">GetDevicePluginOptions</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">(),</span> <span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">Empty</span><span class="p">{})</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"failed to get device plugin options: %v"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">m</span><span class="o">.</span><span class="n">registerEndpoint</span><span class="p">(</span><span class="n">pluginName</span><span class="p">,</span> <span class="n">options</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>
	<span class="k">go</span> <span class="n">m</span><span class="o">.</span><span class="n">runEndpoint</span><span class="p">(</span><span class="n">pluginName</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In general, an <code class="language-plaintext highlighter-rouge">endpoint</code> is registered and run for the plugin. 
The term <code class="language-plaintext highlighter-rouge">endpoint</code> represents a single registered device plugin. Its definition is as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/abf87c99c63984ba426239e0aed657bf9a8a9054/pkg/kubelet/cm/devicemanager/endpoint.go#L35</span>
<span class="k">type</span> <span class="n">endpoint</span> <span class="k">interface</span> <span class="p">{</span>
	<span class="n">run</span><span class="p">()</span>
	<span class="n">stop</span><span class="p">()</span>
	<span class="n">allocate</span><span class="p">(</span><span class="n">devs</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
    <span class="n">callback</span><span class="p">(</span><span class="n">resourceName</span> <span class="kt">string</span><span class="p">,</span> <span class="n">devices</span> <span class="p">[]</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">Device</span><span class="p">)</span>
    <span class="o">...</span>
<span class="p">}</span>

<span class="k">type</span> <span class="n">endpointImpl</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">client</span>     <span class="n">pluginapi</span><span class="o">.</span><span class="n">DevicePluginClient</span>
	<span class="n">clientConn</span> <span class="o">*</span><span class="n">grpc</span><span class="o">.</span><span class="n">ClientConn</span>

	<span class="n">socketPath</span>   <span class="kt">string</span>
	<span class="n">resourceName</span> <span class="kt">string</span>

	<span class="n">mutex</span> <span class="n">sync</span><span class="o">.</span><span class="n">Mutex</span>
	<span class="n">cb</span>    <span class="n">monitorCallback</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It contains necessary methods/fields for plugin framework to communicate with plugin.</p>

<h3 id="device-discovery">device discovery</h3>
<p>After plugin is registered, kubelet will start watching for device changes. The logic is as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/abf87c99c63984ba426239e0aed657bf9a8a9054/pkg/kubelet/cm/devicemanager/endpoint.go#L96</span>
<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">endpointImpl</span><span class="p">)</span> <span class="n">run</span><span class="p">()</span> <span class="p">{</span>
	<span class="n">stream</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">e</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">ListAndWatch</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">(),</span> <span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">Empty</span><span class="p">{})</span>
	<span class="k">for</span> <span class="p">{</span>
		<span class="n">response</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">stream</span><span class="o">.</span><span class="n">Recv</span><span class="p">()</span>
		<span class="n">devs</span> <span class="o">:=</span> <span class="n">response</span><span class="o">.</span><span class="n">Devices</span>
		<span class="k">var</span> <span class="n">newDevs</span> <span class="p">[]</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">Device</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">d</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">devs</span> <span class="p">{</span>
			<span class="n">newDevs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">newDevs</span><span class="p">,</span> <span class="o">*</span><span class="n">d</span><span class="p">)</span>
		<span class="p">}</span>

		<span class="n">e</span><span class="o">.</span><span class="n">callback</span><span class="p">(</span><span class="n">e</span><span class="o">.</span><span class="n">resourceName</span><span class="p">,</span> <span class="n">newDevs</span><span class="p">)</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Device is watched through <code class="language-plaintext highlighter-rouge">ListAndWatch</code> grpc call. Whenever there is device changes, it is received by kubelet and a callback is called to record device info.</p>

<h3 id="device-allocation">device allocation</h3>
<p>The key usage of a device plugin is to have device allocated to a container.</p>

<p>It is done through <code class="language-plaintext highlighter-rouge">endpoint::allocate</code> function:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// allocate issues Allocate gRPC call to the device plugin.</span>
<span class="k">func</span> <span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">endpointImpl</span><span class="p">)</span> <span class="n">allocate</span><span class="p">(</span><span class="n">devs</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">e</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">Allocate</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">(),</span> <span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateRequest</span><span class="p">{</span>
		<span class="n">ContainerRequests</span><span class="o">:</span> <span class="p">[]</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">ContainerAllocateRequest</span><span class="p">{</span>
			<span class="p">{</span><span class="n">DevicesIDs</span><span class="o">:</span> <span class="n">devs</span><span class="p">},</span>
		<span class="p">},</span>
	<span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The function sends grpc request with <code class="language-plaintext highlighter-rouge">ContainerAllocateRequest</code> and return with <code class="language-plaintext highlighter-rouge">ContainerAllocateResponse</code>. They are both protobuf message, definition is as follows:</p>
<div class="language-protobuf highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// source: https://github.com/kubernetes/kubernetes/blob/296f7c91bb52cd724ce6d6d120d5d41ed459d677/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L162</span>
<span class="kd">message</span> <span class="nc">ContainerAllocateRequest</span> <span class="p">{</span>
	<span class="k">repeated</span> <span class="kt">string</span> <span class="na">devicesIDs</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="kd">message</span> <span class="nc">ContainerAllocateResponse</span> <span class="p">{</span>
  	<span class="c1">// List of environment variable to be set in the container to access one of more devices.</span>
	<span class="n">map</span><span class="o">&lt;</span><span class="kt">string</span><span class="p">,</span> <span class="kt">string</span><span class="err">&gt;</span> <span class="na">envs</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
	<span class="c1">// Mounts for the container.</span>
	<span class="k">repeated</span> <span class="n">Mount</span> <span class="na">mounts</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
	<span class="c1">// Devices for the container.</span>
	<span class="k">repeated</span> <span class="n">DeviceSpec</span> <span class="na">devices</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
	<span class="c1">// Container annotations to pass to the container runtime</span>
	<span class="n">map</span><span class="o">&lt;</span><span class="kt">string</span><span class="p">,</span> <span class="kt">string</span><span class="err">&gt;</span> <span class="na">annotations</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The request message is simple: if container want to use some device, it just sends device’s ID to the plugin.</p>

<p>The response message indicates <strong>how to use</strong> this device in container.</p>

<p><code class="language-plaintext highlighter-rouge">DeviceSpec</code> specifies core device attribute including path/permission/etc.</p>

<p><code class="language-plaintext highlighter-rouge">mounts</code> is needed for device driver/library to be mounted into container.</p>

<p>Also, environment variable/annotations help access of device in container as well.</p>

<h2 id="nvidia-device-plugin">nvidia device plugin</h2>
<h3 id="introduction-1">Introduction</h3>
<p>NVIDIA’s k8s device plugin is crucial for bringing GPU workload to Kubernetes. It serves similar purpose as  <code class="language-plaintext highlighter-rouge">nvidia-docker</code>.</p>

<h3 id="what-nvidia-device-plugin-does">what nvidia-device-plugin does</h3>
<p>In general, nvidia-device-plugin is a GPU device manager for Kubernetes cluster. It will:</p>
<ol>
  <li>respond to grpc requests from plugin-framework;</li>
  <li>monitor all gpus on node;</li>
  <li>return device for allocation;</li>
</ol>

<h3 id="project-architecture">Project Architecture</h3>
<h4 id="servergo">server.go</h4>
<p>nvidia-device-plugin is basically a grpc server. <code class="language-plaintext highlighter-rouge">server.go</code> implements all rpc functions and server related logic.</p>
<h4 id="nvidiago">nvidia.go</h4>
<p>The operations of GPU are implemented in this file which are used by <code class="language-plaintext highlighter-rouge">server.go</code>.</p>
<h4 id="gpu-monitoring-tools">gpu-monitoring-tools</h4>
<p>This project provides golang bindings of lower level management libraries which is used by <code class="language-plaintext highlighter-rouge">nvidia-device-plugin</code> project.</p>
<h4 id="nvmlnvidia-management-library">NVML(NVIDIA Management Library)</h4>
<p>This library provides C-API for monitoring/management of NVIDIA GPU devices. It is used by <code class="language-plaintext highlighter-rouge">nvidia-smi</code> and other libraries including <code class="language-plaintext highlighter-rouge">gpu-monitoring-tools</code>.</p>

<h3 id="listandwatch">ListAndWatch</h3>
<p>This function monitors GPU devices on node.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://gitlab.com/nvidia/kubernetes/device-plugin/blob/4167bfd7fdfdbec6a5378af3589650714cf2ab3f/server.go#L218</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">NvidiaDevicePlugin</span><span class="p">)</span> <span class="n">ListAndWatch</span><span class="p">(</span><span class="n">e</span> <span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">Empty</span><span class="p">,</span> <span class="n">s</span> <span class="n">pluginapi</span><span class="o">.</span><span class="n">DevicePlugin_ListAndWatchServer</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
    <span class="n">s</span><span class="o">.</span><span class="n">Send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">ListAndWatchResponse</span><span class="p">{</span><span class="n">Devices</span><span class="o">:</span> <span class="n">m</span><span class="o">.</span><span class="n">apiDevices</span><span class="p">()})</span>
	<span class="k">for</span> <span class="p">{</span>
		<span class="k">select</span> <span class="p">{</span>
		<span class="k">case</span> <span class="o">&lt;-</span><span class="n">m</span><span class="o">.</span><span class="n">stop</span><span class="o">:</span>
			<span class="k">return</span> <span class="no">nil</span>
		<span class="k">case</span> <span class="n">d</span> <span class="o">:=</span> <span class="o">&lt;-</span><span class="n">m</span><span class="o">.</span><span class="n">health</span><span class="o">:</span>
			<span class="n">d</span><span class="o">.</span><span class="n">Health</span> <span class="o">=</span> <span class="n">pluginapi</span><span class="o">.</span><span class="n">Unhealthy</span>
			<span class="n">log</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"'%s' device marked unhealthy: %s"</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">resourceName</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
			<span class="n">s</span><span class="o">.</span><span class="n">Send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">ListAndWatchResponse</span><span class="p">{</span><span class="n">Devices</span><span class="o">:</span> <span class="n">m</span><span class="o">.</span><span class="n">apiDevices</span><span class="p">()})</span>
		<span class="p">}</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The logic is clear. The function keeps checking state of devices. Whenever there is an update of health, a notification is sent to kubelet through grpc.</p>

<p>The health check is as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://gitlab.com/nvidia/kubernetes/device-plugin/blob/4167bfd7fdfdbec6a5378af3589650714cf2ab3f/nvidia.go#L159</span>
<span class="k">func</span> <span class="n">checkHealth</span><span class="p">(</span><span class="n">stop</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="k">interface</span><span class="p">{},</span> <span class="n">devices</span> <span class="p">[]</span><span class="o">*</span><span class="n">Device</span><span class="p">,</span> <span class="n">unhealthy</span> <span class="k">chan</span><span class="o">&lt;-</span> <span class="o">*</span><span class="n">Device</span><span class="p">)</span> <span class="p">{</span>
	<span class="o">...</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">d</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">devices</span> <span class="p">{</span>
		<span class="n">gpu</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">nvml</span><span class="o">.</span><span class="n">ParseMigDeviceUUID</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">gpu</span> <span class="o">=</span> <span class="n">d</span><span class="o">.</span><span class="n">ID</span>
		<span class="p">}</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">nvml</span><span class="o">.</span><span class="n">RegisterEventForDevice</span><span class="p">(</span><span class="n">eventSet</span><span class="p">,</span> <span class="n">nvml</span><span class="o">.</span><span class="n">XidCriticalError</span><span class="p">,</span> <span class="n">gpu</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">strings</span><span class="o">.</span><span class="n">HasSuffix</span><span class="p">(</span><span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">(),</span> <span class="s">"Not Supported"</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">log</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"Warning: %s is too old to support healthchecking: %s. Marking it unhealthy."</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">ID</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
			<span class="n">unhealthy</span> <span class="o">&lt;-</span> <span class="n">d</span>
			<span class="k">continue</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">for</span> <span class="p">{</span>
		<span class="n">e</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">nvml</span><span class="o">.</span><span class="n">WaitForEvent</span><span class="p">(</span><span class="n">eventSet</span><span class="p">,</span> <span class="m">5000</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="o">&amp;&amp;</span> <span class="n">e</span><span class="o">.</span><span class="n">Etype</span> <span class="o">!=</span> <span class="n">nvml</span><span class="o">.</span><span class="n">XidCriticalError</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="o">...</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">d</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">devices</span> <span class="p">{</span>
			<span class="c">// Please see https://github.com/NVIDIA/gpu-monitoring-tools/blob/148415f505c96052cb3b7fdf443b34ac853139ec/bindings/go/nvml/nvml.h#L1424</span>
			<span class="c">// for the rationale why gi and ci can be set as such when the UUID is a full GPU UUID and not a MIG device UUID.</span>
			<span class="n">gpu</span><span class="p">,</span> <span class="n">gi</span><span class="p">,</span> <span class="n">ci</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">nvml</span><span class="o">.</span><span class="n">ParseMigDeviceUUID</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
			<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
				<span class="n">gpu</span> <span class="o">=</span> <span class="n">d</span><span class="o">.</span><span class="n">ID</span>
				<span class="n">gi</span> <span class="o">=</span> <span class="m">0xFFFFFFFF</span>
				<span class="n">ci</span> <span class="o">=</span> <span class="m">0xFFFFFFFF</span>
			<span class="p">}</span>

			<span class="k">if</span> <span class="n">gpu</span> <span class="o">==</span> <span class="o">*</span><span class="n">e</span><span class="o">.</span><span class="n">UUID</span> <span class="o">&amp;&amp;</span> <span class="n">gi</span> <span class="o">==</span> <span class="o">*</span><span class="n">e</span><span class="o">.</span><span class="n">GpuInstanceId</span> <span class="o">&amp;&amp;</span> <span class="n">ci</span> <span class="o">==</span> <span class="o">*</span><span class="n">e</span><span class="o">.</span><span class="n">ComputeInstanceId</span> <span class="p">{</span>
				<span class="n">log</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"XidCriticalError: Xid=%d on Device=%s, the device will go unhealthy."</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">Edata</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">ID</span><span class="p">)</span>
				<span class="n">unhealthy</span> <span class="o">&lt;-</span> <span class="n">d</span>
			<span class="p">}</span>
		<span class="p">}</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The GPU event is registered and monitored through NVML’s binding function. When there is an unhealthy-GPU event, it is reported back to <code class="language-plaintext highlighter-rouge">ListAndWatch</code> through <code class="language-plaintext highlighter-rouge">unhealthy</code> channel.</p>

<h3 id="allocate">Allocate</h3>
<p>Let’s see what info is sent back to kubelet in order for container to use GPU device:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://gitlab.com/nvidia/kubernetes/device-plugin/blob/4167bfd7fdfdbec6a5378af3589650714cf2ab3f/server.go#L265</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">NvidiaDevicePlugin</span><span class="p">)</span> <span class="n">Allocate</span><span class="p">(</span><span class="n">ctx</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">reqs</span> <span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateRequest</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateResponse</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">responses</span> <span class="o">:=</span> <span class="n">pluginapi</span><span class="o">.</span><span class="n">AllocateResponse</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">req</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">reqs</span><span class="o">.</span><span class="n">ContainerRequests</span> <span class="p">{</span>
		<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">id</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">req</span><span class="o">.</span><span class="n">DevicesIDs</span> <span class="p">{</span>
			<span class="k">if</span> <span class="o">!</span><span class="n">m</span><span class="o">.</span><span class="n">deviceExists</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="p">{</span>
				<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"invalid allocation request for '%s': unknown device: %s"</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">resourceName</span><span class="p">,</span> <span class="n">id</span><span class="p">)</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="n">response</span> <span class="o">:=</span> <span class="n">pluginapi</span><span class="o">.</span><span class="n">ContainerAllocateResponse</span><span class="p">{}</span>
		<span class="k">if</span> <span class="o">*</span><span class="n">deviceListStrategyFlag</span> <span class="o">==</span> <span class="n">DeviceListStrategyVolumeMounts</span> <span class="p">{</span>
			<span class="n">response</span><span class="o">.</span><span class="n">Envs</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">apiEnvs</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">deviceListEnvvar</span><span class="p">,</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="n">deviceListAsVolumeMountsContainerPathRoot</span><span class="p">})</span>
			<span class="n">response</span><span class="o">.</span><span class="n">Mounts</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">apiMounts</span><span class="p">(</span><span class="n">req</span><span class="o">.</span><span class="n">DevicesIDs</span><span class="p">)</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="o">*</span><span class="n">passDeviceSpecs</span> <span class="p">{</span>
			<span class="n">response</span><span class="o">.</span><span class="n">Devices</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">apiDeviceSpecs</span><span class="p">(</span><span class="n">req</span><span class="o">.</span><span class="n">DevicesIDs</span><span class="p">)</span>
		<span class="p">}</span>
		<span class="n">responses</span><span class="o">.</span><span class="n">ContainerResponses</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">responses</span><span class="o">.</span><span class="n">ContainerResponses</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">response</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="o">&amp;</span><span class="n">responses</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As code indicates, DeviceSpec, Mount and Envs consist of response.</p>

<p>Here is what <code class="language-plaintext highlighter-rouge">DeviceSpec</code> looks like:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://gitlab.com/nvidia/kubernetes/device-plugin/blob/4167bfd7fdfdbec6a5378af3589650714cf2ab3f/server.go#L351</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">NvidiaDevicePlugin</span><span class="p">)</span> <span class="n">apiDeviceSpecs</span><span class="p">(</span><span class="n">filter</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">[]</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">DeviceSpec</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">specs</span> <span class="p">[]</span><span class="o">*</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">DeviceSpec</span>

	<span class="n">paths</span> <span class="o">:=</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span>
		<span class="s">"/dev/nvidiactl"</span><span class="p">,</span>
		<span class="s">"/dev/nvidia-uvm"</span><span class="p">,</span>
		<span class="s">"/dev/nvidia-uvm-tools"</span><span class="p">,</span>
		<span class="s">"/dev/nvidia-modeset"</span><span class="p">,</span>
	<span class="p">}</span>

	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">p</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">paths</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Stat</span><span class="p">(</span><span class="n">p</span><span class="p">);</span> <span class="n">err</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">spec</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">pluginapi</span><span class="o">.</span><span class="n">DeviceSpec</span><span class="p">{</span>
				<span class="n">ContainerPath</span><span class="o">:</span> <span class="n">p</span><span class="p">,</span>
				<span class="n">HostPath</span><span class="o">:</span>      <span class="n">p</span><span class="p">,</span>
				<span class="n">Permissions</span><span class="o">:</span>   <span class="s">"rw"</span><span class="p">,</span>
			<span class="p">}</span>
			<span class="n">specs</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">specs</span><span class="p">,</span> <span class="n">spec</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>
    <span class="o">...</span>
	<span class="k">return</span> <span class="n">specs</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can see that all NVIDIA related device paths have been attached to response. Kubelet will mount all these paths to enable GPU inside container.</p>]]></content><author><name></name></author><category term="cloud" /><category term="kubernetes" /><category term="golang" /><category term="container" /><category term="gpu" /><summary type="html"><![CDATA[Mechanism and source code analysis of Kubernetes device plugin.]]></summary></entry><entry><title type="html">Kubernetes Project Exploration, Part 3 - kubelet mechanism and source code analysis</title><link href="https://blog.labxq.com/cloud/2020/10/18/kubernetes-exploration-part3-kubelet.html" rel="alternate" type="text/html" title="Kubernetes Project Exploration, Part 3 - kubelet mechanism and source code analysis" /><published>2020-10-18T00:00:00-07:00</published><updated>2020-10-18T00:00:00-07:00</updated><id>https://blog.labxq.com/cloud/2020/10/18/kubernetes-exploration-part3-kubelet</id><content type="html" xml:base="https://blog.labxq.com/cloud/2020/10/18/kubernetes-exploration-part3-kubelet.html"><![CDATA[<h2 id="overview">Overview</h2>
<p>Kubelet is a <strong>node agent</strong> running on each Kubernetes node. It is basically a pod manager controlling pods running on the node.</p>

<p>PodSpec yaml file is provided by client and received through API server, then kubelet will update pod on scheduled node.</p>

<p>Kubelet acts as a gRPC client communicating with container runtime, instruct runtime to do the actual container operation.</p>

<p>Following is the structure of kubelet:</p>

<p><img src="/assets/images/kubelet_structure.svg" alt="kubelet structure" /></p>

<h2 id="project-architecture">Project Architecture</h2>
<p>kubelet is one of the core components of Kubernetes. Its source code sits directly inside main project at <a href="https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet">pkg/kubelet</a> folder.</p>

<p>Following are important sub-folders of kubelet project:</p>
<h3 id="server">server</h3>
<p>Kubelet communicates with control plane through http call. Thus kubelet itself is an http server.</p>
<h3 id="config">config</h3>
<p>The main input of kubelet server is “config” of Pod. <code class="language-plaintext highlighter-rouge">config</code> folder contains Object-Oriented abstraction of different types of config.</p>
<h3 id="podimagesnetworkvolumemanager">pod/images/network/volumemanager</h3>
<p>These folders contain management code of different aspects of a Pod. They determine how a Pod is run.</p>
<h3 id="cm">cm</h3>
<p>One of the main goal of Kubernetes is resource management. Kubelet should be able to allocate proper resources specified by PodSpec.</p>

<p>In kubelet project this is implemented inside <code class="language-plaintext highlighter-rouge">cm</code> folder, aka container manager. Resources like cpu, memory and device will be managed by <code class="language-plaintext highlighter-rouge">ContainerManager</code>.</p>
<h3 id="containerkuberuntime">container/kuberuntime</h3>
<p>These two folders together serve as the interface of container runtime.</p>

<p>In order to operate a container, kubelet calls functions inside <code class="language-plaintext highlighter-rouge">container</code> folder, which then calls functions inside <code class="language-plaintext highlighter-rouge">kuberuntime</code>.</p>

<p><code class="language-plaintext highlighter-rouge">kuberuntime</code> folder contains grpc services that communicate with underlying container runtime such as Docker’s <code class="language-plaintext highlighter-rouge">containerd</code>.</p>
<h3 id="proberstatsmetricslogs">prober/stats/metrics/logs</h3>
<p>kubelet provides various ways to increase the observability of managed pods. These folders contain code which collects info that is later fetched by control plane.</p>

<h2 id="process-of-creating-a-pod">Process of creating a Pod</h2>
<p>One of the typical use cases of Kubernetes is to start a Pod. The steps are:</p>
<ol>
  <li>User uses kubectl to communicate with API server asking creation of a resource;</li>
  <li>Scheduler schedules Pod to a proper node;</li>
  <li>Pod is started on this node by kubelet;</li>
</ol>

<p>Kubelet is responsible for step 3 of pod creation. Let’s walk through the code to see how it is done.</p>
<h3 id="start-kubelet">start kubelet</h3>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/75242fce7aa8a8f9e703b8602587900ca5aaf937/cmd/kubelet/app/server.go#L1178</span>
<span class="k">func</span> <span class="n">startKubelet</span><span class="p">(</span><span class="n">k</span> <span class="n">kubelet</span><span class="o">.</span><span class="n">Bootstrap</span><span class="p">,</span> <span class="n">podCfg</span> <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">PodConfig</span><span class="p">,</span> <span class="n">kubeCfg</span> <span class="o">*</span><span class="n">kubeletconfiginternal</span><span class="o">.</span><span class="n">KubeletConfiguration</span><span class="p">,</span> <span class="n">kubeDeps</span> <span class="o">*</span><span class="n">kubelet</span><span class="o">.</span><span class="n">Dependencies</span><span class="p">,</span> <span class="n">enableCAdvisorJSONEndpoints</span><span class="p">,</span> <span class="n">enableServer</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
	<span class="c">// start the kubelet</span>
	<span class="k">go</span> <span class="n">k</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">podCfg</span><span class="o">.</span><span class="n">Updates</span><span class="p">())</span>

	<span class="c">// start the kubelet server</span>
	<span class="k">if</span> <span class="n">enableServer</span> <span class="p">{</span>
		<span class="k">go</span> <span class="n">k</span><span class="o">.</span><span class="n">ListenAndServe</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">ParseIP</span><span class="p">(</span><span class="n">kubeCfg</span><span class="o">.</span><span class="n">Address</span><span class="p">),</span> <span class="kt">uint</span><span class="p">(</span><span class="n">kubeCfg</span><span class="o">.</span><span class="n">Port</span><span class="p">),</span> <span class="n">kubeDeps</span><span class="o">.</span><span class="n">TLSOptions</span><span class="p">,</span> <span class="n">kubeDeps</span><span class="o">.</span><span class="n">Auth</span><span class="p">,</span>
			<span class="n">enableCAdvisorJSONEndpoints</span><span class="p">,</span> <span class="n">kubeCfg</span><span class="o">.</span><span class="n">EnableDebuggingHandlers</span><span class="p">,</span> <span class="n">kubeCfg</span><span class="o">.</span><span class="n">EnableContentionProfiling</span><span class="p">,</span> <span class="n">kubeCfg</span><span class="o">.</span><span class="n">EnableSystemLogHandler</span><span class="p">)</span>

	<span class="p">}</span>
	<span class="o">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The main kubelet logic is clear, there are two things to do:</p>
<ol>
  <li>Start kubelet server which listens to incoming instructions;</li>
  <li>Start a goroutine which handles instructions of pod;</li>
</ol>

<p>Following is the pod instruction handling logic:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/75242fce7aa8a8f9e703b8602587900ca5aaf937/pkg/kubelet/kubelet.go#L1306</span>
<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">Run</span><span class="p">(</span><span class="n">updates</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">PodUpdate</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">kl</span><span class="o">.</span><span class="n">logServer</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">logServer</span> <span class="o">=</span> <span class="n">http</span><span class="o">.</span><span class="n">StripPrefix</span><span class="p">(</span><span class="s">"/logs/"</span><span class="p">,</span> <span class="n">http</span><span class="o">.</span><span class="n">FileServer</span><span class="p">(</span><span class="n">http</span><span class="o">.</span><span class="n">Dir</span><span class="p">(</span><span class="s">"/var/log/"</span><span class="p">)))</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">initializeModules</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">recorder</span><span class="o">.</span><span class="n">Eventf</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">nodeRef</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">EventTypeWarning</span><span class="p">,</span> <span class="n">events</span><span class="o">.</span><span class="n">KubeletSetupFailed</span><span class="p">,</span> <span class="n">err</span><span class="o">.</span><span class="n">Error</span><span class="p">())</span>
		<span class="n">klog</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// Start volume manager</span>
	<span class="k">go</span> <span class="n">kl</span><span class="o">.</span><span class="n">volumeManager</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">sourcesReady</span><span class="p">,</span> <span class="n">wait</span><span class="o">.</span><span class="n">NeverStop</span><span class="p">)</span>

	<span class="k">if</span> <span class="n">kl</span><span class="o">.</span><span class="n">kubeClient</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="c">// Start syncing node status immediately, this may set up things the runtime needs to run.</span>
		<span class="k">go</span> <span class="n">wait</span><span class="o">.</span><span class="n">Until</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">syncNodeStatus</span><span class="p">,</span> <span class="n">kl</span><span class="o">.</span><span class="n">nodeStatusUpdateFrequency</span><span class="p">,</span> <span class="n">wait</span><span class="o">.</span><span class="n">NeverStop</span><span class="p">)</span>
		<span class="k">go</span> <span class="n">kl</span><span class="o">.</span><span class="n">fastStatusUpdateOnce</span><span class="p">()</span>

		<span class="c">// start syncing lease</span>
		<span class="k">go</span> <span class="n">kl</span><span class="o">.</span><span class="n">nodeLeaseController</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">wait</span><span class="o">.</span><span class="n">NeverStop</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// Start a goroutine responsible for killing pods (that are not properly</span>
	<span class="c">// handled by pod workers).</span>
	<span class="k">go</span> <span class="n">wait</span><span class="o">.</span><span class="n">Until</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">podKiller</span><span class="o">.</span><span class="n">PerformPodKillingWork</span><span class="p">,</span> <span class="m">1</span><span class="o">*</span><span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">,</span> <span class="n">wait</span><span class="o">.</span><span class="n">NeverStop</span><span class="p">)</span>

	<span class="c">// Start component sync loops.</span>
	<span class="n">kl</span><span class="o">.</span><span class="n">statusManager</span><span class="o">.</span><span class="n">Start</span><span class="p">()</span>
	<span class="n">kl</span><span class="o">.</span><span class="n">probeManager</span><span class="o">.</span><span class="n">Start</span><span class="p">()</span>

	<span class="c">// Start syncing RuntimeClasses if enabled.</span>
	<span class="k">if</span> <span class="n">kl</span><span class="o">.</span><span class="n">runtimeClassManager</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">runtimeClassManager</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">wait</span><span class="o">.</span><span class="n">NeverStop</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="c">// Start the pod lifecycle event generator.</span>
	<span class="n">kl</span><span class="o">.</span><span class="n">pleg</span><span class="o">.</span><span class="n">Start</span><span class="p">()</span>
	<span class="n">kl</span><span class="o">.</span><span class="n">syncLoop</span><span class="p">(</span><span class="n">updates</span><span class="p">,</span> <span class="n">kl</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can see that it consists of all kinds of “managers” which handle logic like podUpdate, volume, status, logs, liveness probe, etc.</p>

<h3 id="podconfig">PodConfig</h3>
<p>The specification of pod is stored in a struct <code class="language-plaintext highlighter-rouge">PodConfig</code>:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/0ed41c3f1036785c6c86dd35d20412c8387cf382/pkg/kubelet/config/config.go#L56</span>
<span class="k">type</span> <span class="n">PodConfig</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">pods</span> <span class="o">*</span><span class="n">podStorage</span>
	<span class="n">mux</span>  <span class="o">*</span><span class="n">config</span><span class="o">.</span><span class="n">Mux</span>

	<span class="c">// the channel of denormalized changes passed to listeners</span>
	<span class="n">updates</span> <span class="k">chan</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">PodUpdate</span>

	<span class="c">// contains the list of all configured sources</span>
	<span class="n">sourcesLock</span> <span class="n">sync</span><span class="o">.</span><span class="n">Mutex</span>
	<span class="n">sources</span>     <span class="n">sets</span><span class="o">.</span><span class="n">String</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It contains a list of <code class="language-plaintext highlighter-rouge">pods</code> and a channel <code class="language-plaintext highlighter-rouge">updates</code> which listens to update of pods.</p>

<p>The update of pod is represented as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/0ed41c3f1036785c6c86dd35d20412c8387cf382/pkg/kubelet/types/pod_update.go#L74</span>
<span class="k">type</span> <span class="n">PodUpdate</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">Pods</span>   <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span>
	<span class="n">Op</span>     <span class="n">PodOperation</span>
	<span class="n">Source</span> <span class="kt">string</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Whenever there is an update of Pod, it is sent through <code class="language-plaintext highlighter-rouge">updates</code> channel and received by handler, which performs pod update on the node.</p>

<h3 id="podconfig-synchronizer">PodConfig synchronizer</h3>
<p>Among all handling logic, <code class="language-plaintext highlighter-rouge">syncLoop</code> is the main loop processing update of Pods. It syncs from running state to desired state.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/8c724d793370605d0c474eb6e4fb74779212ff1d/pkg/kubelet/kubelet.go#L1785</span>
<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">syncLoop</span><span class="p">(</span><span class="n">updates</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">PodUpdate</span><span class="p">,</span> <span class="n">handler</span> <span class="n">SyncHandler</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">klog</span><span class="o">.</span><span class="n">Info</span><span class="p">(</span><span class="s">"Starting kubelet main sync loop."</span><span class="p">)</span>
	<span class="c">// The syncTicker wakes up kubelet to checks if there are any pod workers</span>
	<span class="c">// that need to be sync'd. A one-second period is sufficient because the</span>
	<span class="c">// sync interval is defaulted to 10s.</span>
	<span class="n">syncTicker</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">NewTicker</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">)</span>
	<span class="k">defer</span> <span class="n">syncTicker</span><span class="o">.</span><span class="n">Stop</span><span class="p">()</span>
	<span class="n">housekeepingTicker</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">NewTicker</span><span class="p">(</span><span class="n">housekeepingPeriod</span><span class="p">)</span>
	<span class="k">defer</span> <span class="n">housekeepingTicker</span><span class="o">.</span><span class="n">Stop</span><span class="p">()</span>
	<span class="n">plegCh</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">pleg</span><span class="o">.</span><span class="n">Watch</span><span class="p">()</span>
	<span class="k">const</span> <span class="p">(</span>
		<span class="n">base</span>   <span class="o">=</span> <span class="m">100</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Millisecond</span>
		<span class="n">max</span>    <span class="o">=</span> <span class="m">5</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Second</span>
		<span class="n">factor</span> <span class="o">=</span> <span class="m">2</span>
	<span class="p">)</span>
	<span class="n">duration</span> <span class="o">:=</span> <span class="n">base</span>
	<span class="k">for</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">runtimeState</span><span class="o">.</span><span class="n">runtimeErrors</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"skipping pod synchronization - %v"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
			<span class="c">// exponential backoff</span>
			<span class="n">time</span><span class="o">.</span><span class="n">Sleep</span><span class="p">(</span><span class="n">duration</span><span class="p">)</span>
			<span class="n">duration</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">Min</span><span class="p">(</span><span class="kt">float64</span><span class="p">(</span><span class="n">max</span><span class="p">),</span> <span class="n">factor</span><span class="o">*</span><span class="kt">float64</span><span class="p">(</span><span class="n">duration</span><span class="p">)))</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="c">// reset backoff if we have a success</span>
		<span class="n">duration</span> <span class="o">=</span> <span class="n">base</span>

		<span class="n">kl</span><span class="o">.</span><span class="n">syncLoopMonitor</span><span class="o">.</span><span class="n">Store</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">clock</span><span class="o">.</span><span class="n">Now</span><span class="p">())</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">kl</span><span class="o">.</span><span class="n">syncLoopIteration</span><span class="p">(</span><span class="n">updates</span><span class="p">,</span> <span class="n">handler</span><span class="p">,</span> <span class="n">syncTicker</span><span class="o">.</span><span class="n">C</span><span class="p">,</span> <span class="n">housekeepingTicker</span><span class="o">.</span><span class="n">C</span><span class="p">,</span> <span class="n">plegCh</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">break</span>
		<span class="p">}</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">syncLoopMonitor</span><span class="o">.</span><span class="n">Store</span><span class="p">(</span><span class="n">kl</span><span class="o">.</span><span class="n">clock</span><span class="o">.</span><span class="n">Now</span><span class="p">())</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Basically it is a non-ending loop listening to config update. It is also woken up periodically to sync to last known desired state.</p>

<p>Based on type of Pod update, the update event is dispatched to appropriate handler:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/8c724d793370605d0c474eb6e4fb74779212ff1d/pkg/kubelet/kubelet.go#L1859</span>
<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">syncLoopIteration</span><span class="p">(</span><span class="n">configCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">PodUpdate</span><span class="p">,</span> <span class="n">handler</span> <span class="n">SyncHandler</span><span class="p">,</span>
	<span class="n">syncCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="n">housekeepingCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="n">time</span><span class="o">.</span><span class="n">Time</span><span class="p">,</span> <span class="n">plegCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="o">*</span><span class="n">pleg</span><span class="o">.</span><span class="n">PodLifecycleEvent</span><span class="p">)</span> <span class="kt">bool</span> <span class="p">{</span>
	<span class="k">select</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">u</span><span class="p">,</span> <span class="n">open</span> <span class="o">:=</span> <span class="o">&lt;-</span><span class="n">configCh</span><span class="o">:</span>
		<span class="c">// Update from a config source; dispatch it to the right handler</span>
		<span class="c">// callback.</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">open</span> <span class="p">{</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"Update channel is closed. Exiting the sync loop."</span><span class="p">)</span>
			<span class="k">return</span> <span class="no">false</span>
		<span class="p">}</span>

		<span class="k">switch</span> <span class="n">u</span><span class="o">.</span><span class="n">Op</span> <span class="p">{</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">ADD</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"SyncLoop (ADD, %q): %q"</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Source</span><span class="p">,</span> <span class="n">format</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">))</span>
			<span class="c">// After restarting, kubelet will get all existing pods through</span>
			<span class="c">// ADD as if they are new pods. These pods will then go through the</span>
			<span class="c">// admission process and *may* be rejected. This can be resolved</span>
			<span class="c">// once we have checkpointing.</span>
			<span class="n">handler</span><span class="o">.</span><span class="n">HandlePodAdditions</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">)</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">UPDATE</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"SyncLoop (UPDATE, %q): %q"</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Source</span><span class="p">,</span> <span class="n">format</span><span class="o">.</span><span class="n">PodsWithDeletionTimestamps</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">))</span>
			<span class="n">handler</span><span class="o">.</span><span class="n">HandlePodUpdates</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">)</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">REMOVE</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"SyncLoop (REMOVE, %q): %q"</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Source</span><span class="p">,</span> <span class="n">format</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">))</span>
			<span class="n">handler</span><span class="o">.</span><span class="n">HandlePodRemoves</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">)</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">RECONCILE</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">4</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"SyncLoop (RECONCILE, %q): %q"</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Source</span><span class="p">,</span> <span class="n">format</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">))</span>
			<span class="n">handler</span><span class="o">.</span><span class="n">HandlePodReconcile</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">)</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">DELETE</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">V</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"SyncLoop (DELETE, %q): %q"</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Source</span><span class="p">,</span> <span class="n">format</span><span class="o">.</span><span class="n">Pods</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">))</span>
			<span class="c">// DELETE is treated as a UPDATE because of graceful deletion.</span>
			<span class="n">handler</span><span class="o">.</span><span class="n">HandlePodUpdates</span><span class="p">(</span><span class="n">u</span><span class="o">.</span><span class="n">Pods</span><span class="p">)</span>
		<span class="k">case</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">SET</span><span class="o">:</span>
			<span class="c">// TODO: Do we want to support this?</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"Kubelet does not support snapshot update"</span><span class="p">)</span>
		<span class="k">default</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"Invalid event type received: %d."</span><span class="p">,</span> <span class="n">u</span><span class="o">.</span><span class="n">Op</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="o">...</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">true</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="pod-update-handler">Pod Update Handler</h3>
<p>According to the above code, a pod-ADD operation is handled by <code class="language-plaintext highlighter-rouge">HandlePodAdditions</code>:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/8c724d793370605d0c474eb6e4fb74779212ff1d/pkg/kubelet/kubelet.go#L2010</span>
<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">HandlePodAdditions</span><span class="p">(</span><span class="n">pods</span> <span class="p">[]</span><span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">start</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">clock</span><span class="o">.</span><span class="n">Now</span><span class="p">()</span>
	<span class="n">sort</span><span class="o">.</span><span class="n">Sort</span><span class="p">(</span><span class="n">sliceutils</span><span class="o">.</span><span class="n">PodsByCreationTime</span><span class="p">(</span><span class="n">pods</span><span class="p">))</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">pod</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">pods</span> <span class="p">{</span>
		<span class="n">existingPods</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">podManager</span><span class="o">.</span><span class="n">GetPods</span><span class="p">()</span>
		<span class="c">// Always add the pod to the pod manager. Kubelet relies on the pod</span>
		<span class="c">// manager as the source of truth for the desired state. If a pod does</span>
		<span class="c">// not exist in the pod manager, it means that it has been deleted in</span>
		<span class="c">// the apiserver and no action (other than cleanup) is required.</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">podManager</span><span class="o">.</span><span class="n">AddPod</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span>
		<span class="o">...</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">dispatchWork</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">kubetypes</span><span class="o">.</span><span class="n">SyncPodCreate</span><span class="p">,</span> <span class="n">mirrorPod</span><span class="p">,</span> <span class="n">start</span><span class="p">)</span>
		<span class="n">kl</span><span class="o">.</span><span class="n">probeManager</span><span class="o">.</span><span class="n">AddPod</span><span class="p">(</span><span class="n">pod</span><span class="p">)</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The pod is first added to podManager indicating the desired state. Then actual pod creation work is dispatched to a <code class="language-plaintext highlighter-rouge">pod worker</code> and asynchronously handled.</p>

<p>The <code class="language-plaintext highlighter-rouge">podWorkers</code> struct has <code class="language-plaintext highlighter-rouge">UpdatePod</code> function which creates a goroutine handling pod-update work.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/kubelet/pod_workers.go#L220</span>
<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
	<span class="k">defer</span> <span class="n">runtime</span><span class="o">.</span><span class="n">HandleCrash</span><span class="p">()</span>
	<span class="n">p</span><span class="o">.</span><span class="n">managePodLoop</span><span class="p">(</span><span class="n">podUpdates</span><span class="p">)</span>
<span class="p">}()</span>
</code></pre></div></div>
<p>A <code class="language-plaintext highlighter-rouge">syncPodFn</code> function is used to do the actual pod operation.  It is configured when kubelet starts.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source:https://github.com/kubernetes/kubernetes/blob/442a69c3bdf6fe8e525b05887e57d89db1e2f3a5/pkg/kubelet/pod_workers.go#L175</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">syncPodFn</span><span class="p">(</span><span class="n">syncPodOptions</span><span class="p">{</span>
	<span class="n">mirrorPod</span><span class="o">:</span>      <span class="n">update</span><span class="o">.</span><span class="n">MirrorPod</span><span class="p">,</span>
	<span class="n">pod</span><span class="o">:</span>            <span class="n">update</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span>
	<span class="n">podStatus</span><span class="o">:</span>      <span class="n">status</span><span class="p">,</span>
	<span class="n">killPodOptions</span><span class="o">:</span> <span class="n">update</span><span class="o">.</span><span class="n">KillPodOptions</span><span class="p">,</span>
	<span class="n">updateType</span><span class="o">:</span>     <span class="n">update</span><span class="o">.</span><span class="n">UpdateType</span><span class="p">,</span>
<span class="p">})</span>
</code></pre></div></div>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/8c724d793370605d0c474eb6e4fb74779212ff1d/pkg/kubelet/kubelet.go#L1438</span>
<span class="k">func</span> <span class="p">(</span><span class="n">kl</span> <span class="o">*</span><span class="n">Kubelet</span><span class="p">)</span> <span class="n">syncPod</span><span class="p">(</span><span class="n">o</span> <span class="n">syncPodOptions</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="o">...</span>
	<span class="n">result</span> <span class="o">:=</span> <span class="n">kl</span><span class="o">.</span><span class="n">containerRuntime</span><span class="o">.</span><span class="n">SyncPod</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">podStatus</span><span class="p">,</span> <span class="n">pullSecrets</span><span class="p">,</span> <span class="n">kl</span><span class="o">.</span><span class="n">backOff</span><span class="p">)</span>
	<span class="o">...</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/e6c67c32e140f88c923499b2a35fb96b34fdfdd2/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L661</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">kubeGenericRuntimeManager</span><span class="p">)</span> <span class="n">SyncPod</span><span class="p">(</span><span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">podStatus</span> <span class="o">*</span><span class="n">kubecontainer</span><span class="o">.</span><span class="n">PodStatus</span><span class="p">,</span> <span class="n">pullSecrets</span> <span class="p">[]</span><span class="n">v1</span><span class="o">.</span><span class="n">Secret</span><span class="p">,</span> <span class="n">backOff</span> <span class="o">*</span><span class="n">flowcontrol</span><span class="o">.</span><span class="n">Backoff</span><span class="p">)</span> <span class="p">(</span><span class="n">result</span> <span class="n">kubecontainer</span><span class="o">.</span><span class="n">PodSyncResult</span><span class="p">)</span> <span class="p">{</span>
	<span class="o">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The pod operation is backed by container runtime. The function of starting a container is as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/6d001ebb68efd8a499c07b37b9b59158ca6159c8/pkg/kubelet/kuberuntime/kuberuntime_container.go#L134</span>
<span class="k">func</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span><span class="n">kubeGenericRuntimeManager</span><span class="p">)</span> <span class="n">startContainer</span><span class="p">(</span><span class="n">podSandboxID</span> <span class="kt">string</span><span class="p">,</span> <span class="n">podSandboxConfig</span> <span class="o">*</span><span class="n">runtimeapi</span><span class="o">.</span><span class="n">PodSandboxConfig</span><span class="p">,</span> <span class="n">spec</span> <span class="o">*</span><span class="n">startSpec</span><span class="p">,</span> <span class="n">pod</span> <span class="o">*</span><span class="n">v1</span><span class="o">.</span><span class="n">Pod</span><span class="p">,</span> <span class="n">podStatus</span> <span class="o">*</span><span class="n">kubecontainer</span><span class="o">.</span><span class="n">PodStatus</span><span class="p">,</span> <span class="n">pullSecrets</span> <span class="p">[]</span><span class="n">v1</span><span class="o">.</span><span class="n">Secret</span><span class="p">,</span> <span class="n">podIP</span> <span class="kt">string</span><span class="p">,</span> <span class="n">podIPs</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="kt">string</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">container</span> <span class="o">:=</span> <span class="n">spec</span><span class="o">.</span><span class="n">container</span>

	<span class="c">// Step 1: pull the image.</span>
	<span class="n">imageRef</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">m</span><span class="o">.</span><span class="n">imagePuller</span><span class="o">.</span><span class="n">EnsureImageExists</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">pullSecrets</span><span class="p">,</span> <span class="n">podSandboxConfig</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">grpcstatus</span><span class="o">.</span><span class="n">FromError</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
		<span class="n">m</span><span class="o">.</span><span class="n">recordContainerEvent</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">EventTypeWarning</span><span class="p">,</span> <span class="n">events</span><span class="o">.</span><span class="n">FailedToCreateContainer</span><span class="p">,</span> <span class="s">"Error: %v"</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">Message</span><span class="p">())</span>
		<span class="k">return</span> <span class="n">msg</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="c">// Step 2: create the container.</span>
	<span class="c">// For a new container, the RestartCount should be 0</span>
	<span class="o">...</span>
	<span class="n">containerID</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">m</span><span class="o">.</span><span class="n">runtimeService</span><span class="o">.</span><span class="n">CreateContainer</span><span class="p">(</span><span class="n">podSandboxID</span><span class="p">,</span> <span class="n">containerConfig</span><span class="p">,</span> <span class="n">podSandboxConfig</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">grpcstatus</span><span class="o">.</span><span class="n">FromError</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
		<span class="n">m</span><span class="o">.</span><span class="n">recordContainerEvent</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">containerID</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">EventTypeWarning</span><span class="p">,</span> <span class="n">events</span><span class="o">.</span><span class="n">FailedToCreateContainer</span><span class="p">,</span> <span class="s">"Error: %v"</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">Message</span><span class="p">())</span>
		<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">Message</span><span class="p">(),</span> <span class="n">ErrCreateContainer</span>
	<span class="p">}</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">internalLifecycle</span><span class="o">.</span><span class="n">PreStartContainer</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">containerID</span><span class="p">)</span>

	<span class="c">// Step 3: start the container.</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">runtimeService</span><span class="o">.</span><span class="n">StartContainer</span><span class="p">(</span><span class="n">containerID</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">grpcstatus</span><span class="o">.</span><span class="n">FromError</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
		<span class="n">m</span><span class="o">.</span><span class="n">recordContainerEvent</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">containerID</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">EventTypeWarning</span><span class="p">,</span> <span class="n">events</span><span class="o">.</span><span class="n">FailedToStartContainer</span><span class="p">,</span> <span class="s">"Error: %v"</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">Message</span><span class="p">())</span>
		<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">Message</span><span class="p">(),</span> <span class="n">kubecontainer</span><span class="o">.</span><span class="n">ErrRunContainer</span>
	<span class="p">}</span>
	<span class="n">m</span><span class="o">.</span><span class="n">recordContainerEvent</span><span class="p">(</span><span class="n">pod</span><span class="p">,</span> <span class="n">container</span><span class="p">,</span> <span class="n">containerID</span><span class="p">,</span> <span class="n">v1</span><span class="o">.</span><span class="n">EventTypeNormal</span><span class="p">,</span> <span class="n">events</span><span class="o">.</span><span class="n">StartedContainer</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"Started container %s"</span><span class="p">,</span> <span class="n">container</span><span class="o">.</span><span class="n">Name</span><span class="p">))</span>

	<span class="c">// Step 4: execute the post start hook.</span>
	<span class="o">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is the lowest point of pod creation logic in kubelet. Kubelet would use <code class="language-plaintext highlighter-rouge">grpc</code> client to communicate with Container Runtime(e.g Docker containerd) and instruct it to do the actual container operation.</p>]]></content><author><name></name></author><category term="cloud" /><category term="kubernetes" /><category term="golang" /><category term="container" /><summary type="html"><![CDATA[Analysis of kubelet project.]]></summary></entry><entry><title type="html">Kubernetes Project Exploration, Part 2 - kubectl/kube-apiserver mechanism analysis and source code walk-through</title><link href="https://blog.labxq.com/cloud/2020/10/09/kubernetes-exploration-part2-kubectl-apiserver.html" rel="alternate" type="text/html" title="Kubernetes Project Exploration, Part 2 - kubectl/kube-apiserver mechanism analysis and source code walk-through" /><published>2020-10-09T00:00:00-07:00</published><updated>2020-10-09T00:00:00-07:00</updated><id>https://blog.labxq.com/cloud/2020/10/09/kubernetes-exploration-part2-kubectl-apiserver</id><content type="html" xml:base="https://blog.labxq.com/cloud/2020/10/09/kubernetes-exploration-part2-kubectl-apiserver.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>kubectl/kube-apiserver together form the client-server communication model of Kubernetes. As interface of k8s, it is the entry point to dive into the whole system.</p>

<h2 id="kubectl">kubectl</h2>
<p>Kubectl, the command line client of k8s, is usually the first tool people use to get in touch with k8s. In essence it is just an http client requesting different APIs and getting results from k8s API server.</p>
<h3 id="project-structure">Project Structure</h3>
<p>In the good old days when k8s project was relatively small, kubectl source code was directly inside kubernetes main project.</p>

<p>For the purpose of making projects more modularized, kubectl was recently factored out of main k8s project and became a stand-alone project <a href="https://github.com/kubernetes/kubectl">kubernetes/kubectl</a>. <code class="language-plaintext highlighter-rouge">kubernetes/pkg/kubectl/</code> directory now contains only code which forwards logic to kubectl library.</p>

<p>The kubectl project structure is as follows:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>staging/src/k8s.io/kubectl/pkg
├── apply
├── apps
├── cmd
├── describe
├── drain
├── explain
├── generate
├── generated
├── metricsutil
├── polymorphichelpers
├── proxy
├── rawhttp
├── scale
├── scheme
├── util
└── validation
</code></pre></div></div>
<p>We can see that contents of kubectl project are basically implementations of all kinds of kubectl commands.</p>

<h3 id="process-of-kubectl-get">Process of “kubectl get”</h3>
<p>Here is what the most common kubectl operation <code class="language-plaintext highlighter-rouge">get</code> looks like:</p>

<p>First, in k8s main project, we call <code class="language-plaintext highlighter-rouge">NewCmdGet</code> from kubectl library:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// source: https://github.com/kubernetes/kubernetes/blob/1b32dfdafdcd6cce21415c75385970a9ae5b0f01/pkg/kubectl/cmd/cmd.go#L432</span>
<span class="k">func</span> <span class="n">NewKubectlCommand</span><span class="p">(</span><span class="n">in</span> <span class="n">io</span><span class="o">.</span><span class="n">Reader</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">err</span> <span class="n">io</span><span class="o">.</span><span class="n">Writer</span><span class="p">)</span> <span class="o">*</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span> <span class="p">{</span>
    <span class="o">...</span>
    <span class="n">groups</span> <span class="o">:=</span> <span class="n">templates</span><span class="o">.</span><span class="n">CommandGroups</span><span class="p">{</span>
		<span class="p">{</span>
			<span class="n">Message</span><span class="o">:</span> <span class="s">"Basic Commands (Intermediate):"</span><span class="p">,</span>
			<span class="n">Commands</span><span class="o">:</span> <span class="p">[]</span><span class="o">*</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span><span class="p">{</span>
				<span class="n">explain</span><span class="o">.</span><span class="n">NewCmdExplain</span><span class="p">(</span><span class="s">"kubectl"</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">ioStreams</span><span class="p">),</span>
				<span class="n">get</span><span class="o">.</span><span class="n">NewCmdGet</span><span class="p">(</span><span class="s">"kubectl"</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">ioStreams</span><span class="p">),</span>
				<span class="n">edit</span><span class="o">.</span><span class="n">NewCmdEdit</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">ioStreams</span><span class="p">),</span>
				<span class="nb">delete</span><span class="o">.</span><span class="n">NewCmdDelete</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">ioStreams</span><span class="p">),</span>
			<span class="p">},</span>
		<span class="p">},</span>
    <span class="p">}</span>
    <span class="o">...</span>
    <span class="n">templates</span><span class="o">.</span><span class="n">ActsAsRootCommand</span><span class="p">(</span><span class="n">cmds</span><span class="p">,</span> <span class="n">filters</span><span class="p">,</span> <span class="n">groups</span><span class="o">...</span><span class="p">)</span>
    <span class="o">...</span>
    <span class="k">return</span> <span class="n">cmds</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">get</code> is a sub-command of <code class="language-plaintext highlighter-rouge">kubectl</code>, so its implementation is similar to <code class="language-plaintext highlighter-rouge">NewKubectlCommand</code>:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">NewCmdGet</span><span class="p">(</span><span class="n">parent</span> <span class="kt">string</span><span class="p">,</span> <span class="n">f</span> <span class="n">cmdutil</span><span class="o">.</span><span class="n">Factory</span><span class="p">,</span> <span class="n">streams</span> <span class="n">genericclioptions</span><span class="o">.</span><span class="n">IOStreams</span><span class="p">)</span> <span class="o">*</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span> <span class="p">{</span>
	<span class="n">o</span> <span class="o">:=</span> <span class="n">NewGetOptions</span><span class="p">(</span><span class="n">parent</span><span class="p">,</span> <span class="n">streams</span><span class="p">)</span>
	<span class="n">cmd</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span><span class="p">{</span>
		<span class="n">Use</span><span class="o">:</span>                   <span class="s">"get [(-o|--output=)json|yaml|wide|custom-columns=...|custom-columns-file=...|go-template=...|go-template-file=...|jsonpath=...|jsonpath-file=...] (TYPE[.VERSION][.GROUP] [NAME | -l label] | TYPE[.VERSION][.GROUP]/NAME ...) [flags]"</span><span class="p">,</span>
		<span class="n">DisableFlagsInUseLine</span><span class="o">:</span> <span class="no">true</span><span class="p">,</span>
		<span class="n">Short</span><span class="o">:</span>                 <span class="n">i18n</span><span class="o">.</span><span class="n">T</span><span class="p">(</span><span class="s">"Display one or many resources"</span><span class="p">),</span>
		<span class="n">Long</span><span class="o">:</span>                  <span class="n">getLong</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n\n</span><span class="s">"</span> <span class="o">+</span> <span class="n">cmdutil</span><span class="o">.</span><span class="n">SuggestAPIResources</span><span class="p">(</span><span class="n">parent</span><span class="p">),</span>
		<span class="n">Example</span><span class="o">:</span>               <span class="n">getExample</span><span class="p">,</span>
		<span class="n">Run</span><span class="o">:</span> <span class="k">func</span><span class="p">(</span><span class="n">cmd</span> <span class="o">*</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span><span class="p">,</span> <span class="n">args</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">cmdutil</span><span class="o">.</span><span class="n">CheckErr</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">Complete</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">args</span><span class="p">))</span>
			<span class="n">cmdutil</span><span class="o">.</span><span class="n">CheckErr</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">Validate</span><span class="p">(</span><span class="n">cmd</span><span class="p">))</span>
			<span class="n">cmdutil</span><span class="o">.</span><span class="n">CheckErr</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">args</span><span class="p">))</span>
		<span class="p">},</span>
		<span class="n">SuggestFor</span><span class="o">:</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"list"</span><span class="p">,</span> <span class="s">"ps"</span><span class="p">},</span>
	<span class="p">}</span>
    <span class="o">...</span>
	<span class="k">return</span> <span class="n">cmd</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The actual get operation is inside function <code class="language-plaintext highlighter-rouge">GetOptions.Run</code>:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/b326948a9a317dbc17c6f32dfeea26e090bde3b0/staging/src/k8s.io/kubectl/pkg/cmd/get/get.go#L448</span>
<span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">GetOptions</span><span class="p">)</span> <span class="n">Run</span><span class="p">(</span><span class="n">f</span> <span class="n">cmdutil</span><span class="o">.</span><span class="n">Factory</span><span class="p">,</span> <span class="n">cmd</span> <span class="o">*</span><span class="n">cobra</span><span class="o">.</span><span class="n">Command</span><span class="p">,</span> <span class="n">args</span> <span class="p">[]</span><span class="kt">string</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">r</span> <span class="o">:=</span> <span class="n">f</span><span class="o">.</span><span class="n">NewBuilder</span><span class="p">()</span><span class="o">.</span>
		<span class="n">Unstructured</span><span class="p">()</span><span class="o">.</span>
		<span class="n">NamespaceParam</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">Namespace</span><span class="p">)</span><span class="o">.</span><span class="n">DefaultNamespace</span><span class="p">()</span><span class="o">.</span><span class="n">AllNamespaces</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">AllNamespaces</span><span class="p">)</span><span class="o">.</span>
		<span class="n">FilenameParam</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">ExplicitNamespace</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">o</span><span class="o">.</span><span class="n">FilenameOptions</span><span class="p">)</span><span class="o">.</span>
		<span class="n">LabelSelectorParam</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">LabelSelector</span><span class="p">)</span><span class="o">.</span>
		<span class="n">FieldSelectorParam</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">FieldSelector</span><span class="p">)</span><span class="o">.</span>
		<span class="n">RequestChunksOf</span><span class="p">(</span><span class="n">chunkSize</span><span class="p">)</span><span class="o">.</span>
		<span class="n">ResourceTypeOrNameArgs</span><span class="p">(</span><span class="no">true</span><span class="p">,</span> <span class="n">args</span><span class="o">...</span><span class="p">)</span><span class="o">.</span>
		<span class="n">ContinueOnError</span><span class="p">()</span><span class="o">.</span>
		<span class="n">Latest</span><span class="p">()</span><span class="o">.</span>
		<span class="n">Flatten</span><span class="p">()</span><span class="o">.</span>
		<span class="n">TransformRequests</span><span class="p">(</span><span class="n">o</span><span class="o">.</span><span class="n">transformRequests</span><span class="p">)</span><span class="o">.</span>
        <span class="n">Do</span><span class="p">()</span>
    <span class="o">...</span>
    <span class="n">infos</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">r</span><span class="o">.</span><span class="n">Infos</span><span class="p">()</span>
    <span class="o">...</span>
    <span class="n">printer</span><span class="o">.</span><span class="n">PrintObj</span><span class="p">(</span><span class="n">info</span><span class="o">.</span><span class="n">Object</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span>
	<span class="o">...</span>
</code></pre></div></div>
<p>We can see that <code class="language-plaintext highlighter-rouge">builder</code> pattern is used. All command line options correspond to part of build pipeline.</p>

<p>In <code class="language-plaintext highlighter-rouge">Do</code> function, <code class="language-plaintext highlighter-rouge">Visitor</code> pattern is used. A visitor is responsible for iterating all resources fetched from API server.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/c386fb09a7bde5924a07bd271f6dbb5f4e698aa8/staging/src/k8s.io/cli-runtime/pkg/resource/builder.go#L919</span>
<span class="k">func</span> <span class="p">(</span><span class="n">b</span> <span class="o">*</span><span class="n">Builder</span><span class="p">)</span> <span class="n">visitByResource</span><span class="p">()</span> <span class="o">*</span><span class="n">Result</span> <span class="p">{</span>
    <span class="o">...</span>
	<span class="c">// retrieve one client for each resource</span>
	<span class="n">mappings</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">b</span><span class="o">.</span><span class="n">resourceTupleMappings</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="n">result</span><span class="o">.</span><span class="n">err</span> <span class="o">=</span> <span class="n">err</span>
		<span class="k">return</span> <span class="n">result</span>
	<span class="p">}</span>
	<span class="n">clients</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="n">RESTClient</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">mapping</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">mappings</span> <span class="p">{</span>
		<span class="n">s</span> <span class="o">:=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"%s/%s"</span><span class="p">,</span> <span class="n">mapping</span><span class="o">.</span><span class="n">GroupVersionKind</span><span class="o">.</span><span class="n">GroupVersion</span><span class="p">()</span><span class="o">.</span><span class="n">String</span><span class="p">(),</span> <span class="n">mapping</span><span class="o">.</span><span class="n">Resource</span><span class="o">.</span><span class="n">Resource</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">_</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">clients</span><span class="p">[</span><span class="n">s</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
			<span class="k">continue</span>
		<span class="p">}</span>
		<span class="n">client</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">b</span><span class="o">.</span><span class="n">getClient</span><span class="p">(</span><span class="n">mapping</span><span class="o">.</span><span class="n">GroupVersionKind</span><span class="o">.</span><span class="n">GroupVersion</span><span class="p">())</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">result</span><span class="o">.</span><span class="n">err</span> <span class="o">=</span> <span class="n">err</span>
			<span class="k">return</span> <span class="n">result</span>
		<span class="p">}</span>
		<span class="n">clients</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">client</span>
	<span class="p">}</span>
	<span class="n">items</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">Visitor</span><span class="p">{}</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">tuple</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">b</span><span class="o">.</span><span class="n">resourceTuples</span> <span class="p">{</span>
		<span class="n">mapping</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">mappings</span><span class="p">[</span><span class="n">tuple</span><span class="o">.</span><span class="n">Resource</span><span class="p">]</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">result</span><span class="o">.</span><span class="n">withError</span><span class="p">(</span><span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"resource %q is not recognized: %v"</span><span class="p">,</span> <span class="n">tuple</span><span class="o">.</span><span class="n">Resource</span><span class="p">,</span> <span class="n">mappings</span><span class="p">))</span>
		<span class="p">}</span>
		<span class="n">s</span> <span class="o">:=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"%s/%s"</span><span class="p">,</span> <span class="n">mapping</span><span class="o">.</span><span class="n">GroupVersionKind</span><span class="o">.</span><span class="n">GroupVersion</span><span class="p">()</span><span class="o">.</span><span class="n">String</span><span class="p">(),</span> <span class="n">mapping</span><span class="o">.</span><span class="n">Resource</span><span class="o">.</span><span class="n">Resource</span><span class="p">)</span>
		<span class="n">client</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">clients</span><span class="p">[</span><span class="n">s</span><span class="p">]</span>
		<span class="k">if</span> <span class="o">!</span><span class="n">ok</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">result</span><span class="o">.</span><span class="n">withError</span><span class="p">(</span><span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"could not find a client for resource %q"</span><span class="p">,</span> <span class="n">tuple</span><span class="o">.</span><span class="n">Resource</span><span class="p">))</span>
		<span class="p">}</span>
		<span class="n">selectorNamespace</span> <span class="o">:=</span> <span class="n">b</span><span class="o">.</span><span class="n">namespace</span>
		<span class="n">info</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">Info</span><span class="p">{</span>
			<span class="n">Client</span><span class="o">:</span>    <span class="n">client</span><span class="p">,</span>
			<span class="n">Mapping</span><span class="o">:</span>   <span class="n">mapping</span><span class="p">,</span>
			<span class="n">Namespace</span><span class="o">:</span> <span class="n">selectorNamespace</span><span class="p">,</span>
			<span class="n">Name</span><span class="o">:</span>      <span class="n">tuple</span><span class="o">.</span><span class="n">Name</span><span class="p">,</span>
		<span class="p">}</span>
		<span class="n">items</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">info</span><span class="p">)</span>
	<span class="p">}</span>
	<span class="o">...</span>
	<span class="n">result</span><span class="o">.</span><span class="n">sources</span> <span class="o">=</span> <span class="n">items</span>
	<span class="k">return</span> <span class="n">result</span>
<span class="p">}</span>
</code></pre></div></div>
<p>First a <code class="language-plaintext highlighter-rouge">RESTClient</code> is retrieved. Then use this client to fetch resource and save/return result.</p>

<p><code class="language-plaintext highlighter-rouge">RESTClient</code> is implemented in <code class="language-plaintext highlighter-rouge">client-go</code> library, this library is provided to developers to write customized client on their own.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/b1098bd0d53658bfb945e485683d543ab7dc73ba/staging/src/k8s.io/client-go/rest/client.go#L107</span>
<span class="k">func</span> <span class="n">NewRESTClient</span><span class="p">(</span><span class="n">baseURL</span> <span class="o">*</span><span class="n">url</span><span class="o">.</span><span class="n">URL</span><span class="p">,</span> <span class="n">versionedAPIPath</span> <span class="kt">string</span><span class="p">,</span> <span class="n">config</span> <span class="n">ClientContentConfig</span><span class="p">,</span> <span class="n">rateLimiter</span> <span class="n">flowcontrol</span><span class="o">.</span><span class="n">RateLimiter</span><span class="p">,</span> <span class="n">client</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Client</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">RESTClient</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">ContentType</span><span class="p">)</span> <span class="o">==</span> <span class="m">0</span> <span class="p">{</span>
		<span class="n">config</span><span class="o">.</span><span class="n">ContentType</span> <span class="o">=</span> <span class="s">"application/json"</span>
	<span class="p">}</span>

	<span class="n">base</span> <span class="o">:=</span> <span class="o">*</span><span class="n">baseURL</span>
	<span class="k">if</span> <span class="o">!</span><span class="n">strings</span><span class="o">.</span><span class="n">HasSuffix</span><span class="p">(</span><span class="n">base</span><span class="o">.</span><span class="n">Path</span><span class="p">,</span> <span class="s">"/"</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">base</span><span class="o">.</span><span class="n">Path</span> <span class="o">+=</span> <span class="s">"/"</span>
	<span class="p">}</span>
	<span class="n">base</span><span class="o">.</span><span class="n">RawQuery</span> <span class="o">=</span> <span class="s">""</span>
	<span class="n">base</span><span class="o">.</span><span class="n">Fragment</span> <span class="o">=</span> <span class="s">""</span>

	<span class="k">return</span> <span class="o">&amp;</span><span class="n">RESTClient</span><span class="p">{</span>
		<span class="n">base</span><span class="o">:</span>             <span class="o">&amp;</span><span class="n">base</span><span class="p">,</span>
		<span class="n">versionedAPIPath</span><span class="o">:</span> <span class="n">versionedAPIPath</span><span class="p">,</span>
		<span class="n">content</span><span class="o">:</span>          <span class="n">config</span><span class="p">,</span>
		<span class="n">createBackoffMgr</span><span class="o">:</span> <span class="n">readExpBackoffConfig</span><span class="p">,</span>
		<span class="n">rateLimiter</span><span class="o">:</span>      <span class="n">rateLimiter</span><span class="p">,</span>

		<span class="n">Client</span><span class="o">:</span> <span class="n">client</span><span class="p">,</span>
	<span class="p">},</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In order to build a RESTClient, we need a URL(base + versionedAPIPath) and some configs(content, rateLimiter, etc.)
And the client itself is based on Golang’s http.Client library.</p>

<h3 id="summary">Summary</h3>
<p>So that’s it. We type some commands, which are converted to REST APIs. Result is fetched and stored in some resource struct, and finally printed to terminal. This is the typical workflow of <code class="language-plaintext highlighter-rouge">kubectl</code>.</p>

<h2 id="kube-apiserver">kube-apiserver</h2>
<p>kube-apiserver is the core communication gateway between client/k8s-cluster and components inside k8s cluster.</p>
<h3 id="overview">Overview</h3>
<p>Let’s have a quick review of what an http server usually does:</p>
<ol>
  <li>open a socket listening to incoming request;</li>
  <li>route incoming request to a proper handler;</li>
  <li>handler processes request;</li>
  <li>read/save result to persistent storage;</li>
</ol>

<p>Kube-apiserver’s core logic is of no difference.</p>

<p>In addition to the above, kube-apiserver also provides detailed implementation of authN/authZ/admission that regulates incoming request.</p>

<p>In general, kube-apiserver’s logic can be divided into two parts: setup server, and run server. Let’s take a look at them:</p>
<h3 id="server-setup">Server Setup</h3>
<p>Before API server starts to serve requests, it has to be properly setup.</p>

<p>What setup is needed for a server? If we recall behavior of some famous web servers like Nginx, there are 2 types of setup: handler of request, and various server configurations.</p>
<h4 id="register-handler">Register Handler</h4>
<p>The handlers are installed per APIGroup. Following is the function used to install all handlers in API Group:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/4362d613f243a02558f03e90b8fcb58b4c6efb06/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go#L453</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">GenericAPIServer</span><span class="p">)</span> <span class="n">InstallAPIGroups</span><span class="p">(</span><span class="n">apiGroupInfos</span> <span class="o">...*</span><span class="n">APIGroupInfo</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="o">...</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">apiGroupInfo</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">apiGroupInfos</span> <span class="p">{</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">installAPIResources</span><span class="p">(</span><span class="n">APIGroupPrefix</span><span class="p">,</span> <span class="n">apiGroupInfo</span><span class="p">,</span> <span class="n">openAPIModels</span><span class="p">);</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"unable to install api resources: %v"</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="o">...</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>After several levels of decomposition, a specific type of API Resource is bound to Storage via a handler. It is done through following function:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/c522ee08a3d248ec1097e3673119ffa7a4e1ef7b/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go#L97</span>
<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">APIInstaller</span><span class="p">)</span> <span class="n">Install</span><span class="p">()</span> <span class="p">([]</span><span class="n">metav1</span><span class="o">.</span><span class="n">APIResource</span><span class="p">,</span> <span class="o">*</span><span class="n">restful</span><span class="o">.</span><span class="n">WebService</span><span class="p">,</span> <span class="p">[]</span><span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">apiResources</span> <span class="p">[]</span><span class="n">metav1</span><span class="o">.</span><span class="n">APIResource</span>
	<span class="k">var</span> <span class="n">errors</span> <span class="p">[]</span><span class="kt">error</span>
	<span class="n">ws</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">newWebService</span><span class="p">()</span>

	<span class="c">// Register the paths in a deterministic (sorted) order to get a deterministic swagger spec.</span>
	<span class="n">paths</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">group</span><span class="o">.</span><span class="n">Storage</span><span class="p">))</span>
	<span class="k">var</span> <span class="n">i</span> <span class="kt">int</span> <span class="o">=</span> <span class="m">0</span>
	<span class="k">for</span> <span class="n">path</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">a</span><span class="o">.</span><span class="n">group</span><span class="o">.</span><span class="n">Storage</span> <span class="p">{</span>
		<span class="n">paths</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">path</span>
		<span class="n">i</span><span class="o">++</span>
	<span class="p">}</span>
	<span class="n">sort</span><span class="o">.</span><span class="n">Strings</span><span class="p">(</span><span class="n">paths</span><span class="p">)</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">path</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">paths</span> <span class="p">{</span>
		<span class="n">apiResource</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">a</span><span class="o">.</span><span class="n">registerResourceHandlers</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">group</span><span class="o">.</span><span class="n">Storage</span><span class="p">[</span><span class="n">path</span><span class="p">],</span> <span class="n">ws</span><span class="p">)</span>
		<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">errors</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">errors</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"error in registering resource: %s, %v"</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">err</span><span class="p">))</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="n">apiResource</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">apiResources</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">apiResources</span><span class="p">,</span> <span class="o">*</span><span class="n">apiResource</span><span class="p">)</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">return</span> <span class="n">apiResources</span><span class="p">,</span> <span class="n">ws</span><span class="p">,</span> <span class="n">errors</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The actual CRUD logic is implemented inside <code class="language-plaintext highlighter-rouge">registerResourceHandlers</code> function:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/c522ee08a3d248ec1097e3673119ffa7a4e1ef7b/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go#L185</span>
<span class="k">func</span> <span class="p">(</span><span class="n">a</span> <span class="o">*</span><span class="n">APIInstaller</span><span class="p">)</span> <span class="n">registerResourceHandlers</span><span class="p">(</span><span class="n">path</span> <span class="kt">string</span><span class="p">,</span> <span class="n">storage</span> <span class="n">rest</span><span class="o">.</span><span class="n">Storage</span><span class="p">,</span> <span class="n">ws</span> <span class="o">*</span><span class="n">restful</span><span class="o">.</span><span class="n">WebService</span><span class="p">)</span> <span class="p">(</span><span class="o">*</span><span class="n">metav1</span><span class="o">.</span><span class="n">APIResource</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="o">...</span>
	<span class="n">getter</span><span class="p">,</span> <span class="n">isGetter</span> <span class="o">:=</span> <span class="n">storage</span><span class="o">.</span><span class="p">(</span><span class="n">rest</span><span class="o">.</span><span class="n">Getter</span><span class="p">)</span>
	<span class="o">...</span>
	<span class="n">actions</span> <span class="o">:=</span> <span class="p">[]</span><span class="n">action</span><span class="p">{}</span>
	<span class="n">actions</span> <span class="o">=</span> <span class="n">appendIf</span><span class="p">(</span><span class="n">actions</span><span class="p">,</span> <span class="n">action</span><span class="p">{</span><span class="s">"LIST"</span><span class="p">,</span> <span class="n">resourcePath</span><span class="p">,</span> <span class="n">resourceParams</span><span class="p">,</span> <span class="n">namer</span><span class="p">,</span> <span class="no">false</span><span class="p">},</span> <span class="n">isLister</span><span class="p">)</span>
	<span class="n">actions</span> <span class="o">=</span> <span class="n">appendIf</span><span class="p">(</span><span class="n">actions</span><span class="p">,</span> <span class="n">action</span><span class="p">{</span><span class="s">"POST"</span><span class="p">,</span> <span class="n">resourcePath</span><span class="p">,</span> <span class="n">resourceParams</span><span class="p">,</span> <span class="n">namer</span><span class="p">,</span> <span class="no">false</span><span class="p">},</span> <span class="n">isCreater</span><span class="p">)</span>
	<span class="n">actions</span> <span class="o">=</span> <span class="n">appendIf</span><span class="p">(</span><span class="n">actions</span><span class="p">,</span> <span class="n">action</span><span class="p">{</span><span class="s">"GET"</span><span class="p">,</span> <span class="n">itemPath</span><span class="p">,</span> <span class="n">nameParams</span><span class="p">,</span> <span class="n">namer</span><span class="p">,</span> <span class="no">false</span><span class="p">},</span> <span class="n">isGetter</span><span class="p">)</span>
	<span class="o">...</span>
	<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">action</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">actions</span> <span class="p">{</span>
		<span class="o">...</span>
		<span class="n">routes</span> <span class="o">:=</span> <span class="p">[]</span><span class="o">*</span><span class="n">restful</span><span class="o">.</span><span class="n">RouteBuilder</span><span class="p">{}</span>
		<span class="k">switch</span> <span class="n">action</span><span class="o">.</span><span class="n">Verb</span> <span class="p">{</span>
		<span class="k">case</span> <span class="s">"GET"</span><span class="o">:</span>
			<span class="k">if</span> <span class="n">isGetterWithOptions</span> <span class="p">{</span>
				<span class="n">handler</span> <span class="o">=</span> <span class="n">restfulGetResourceWithOptions</span><span class="p">(</span><span class="n">getterWithOptions</span><span class="p">,</span> <span class="n">reqScope</span><span class="p">,</span> <span class="n">isSubresource</span><span class="p">)</span>
			<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
				<span class="n">handler</span> <span class="o">=</span> <span class="n">restfulGetResource</span><span class="p">(</span><span class="n">getter</span><span class="p">,</span> <span class="n">exporter</span><span class="p">,</span> <span class="n">reqScope</span><span class="p">)</span>
			<span class="p">}</span>
			<span class="o">...</span>
			<span class="n">route</span> <span class="o">:=</span> <span class="n">ws</span><span class="o">.</span><span class="n">GET</span><span class="p">(</span><span class="n">action</span><span class="o">.</span><span class="n">Path</span><span class="p">)</span><span class="o">.</span><span class="n">To</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span><span class="o">.</span>
				<span class="n">Doc</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span><span class="o">.</span>
				<span class="n">Param</span><span class="p">(</span><span class="n">ws</span><span class="o">.</span><span class="n">QueryParameter</span><span class="p">(</span><span class="s">"pretty"</span><span class="p">,</span> <span class="s">"If 'true', then the output is pretty printed."</span><span class="p">))</span><span class="o">.</span>
				<span class="n">Operation</span><span class="p">(</span><span class="s">"read"</span><span class="o">+</span><span class="n">namespaced</span><span class="o">+</span><span class="n">kind</span><span class="o">+</span><span class="n">strings</span><span class="o">.</span><span class="n">Title</span><span class="p">(</span><span class="n">subresource</span><span class="p">)</span><span class="o">+</span><span class="n">operationSuffix</span><span class="p">)</span><span class="o">.</span>
				<span class="n">Produces</span><span class="p">(</span><span class="nb">append</span><span class="p">(</span><span class="n">storageMeta</span><span class="o">.</span><span class="n">ProducesMIMETypes</span><span class="p">(</span><span class="n">action</span><span class="o">.</span><span class="n">Verb</span><span class="p">),</span> <span class="n">mediaTypes</span><span class="o">...</span><span class="p">)</span><span class="o">...</span><span class="p">)</span><span class="o">.</span>
				<span class="n">Returns</span><span class="p">(</span><span class="n">http</span><span class="o">.</span><span class="n">StatusOK</span><span class="p">,</span> <span class="s">"OK"</span><span class="p">,</span> <span class="n">producedObject</span><span class="p">)</span><span class="o">.</span>
				<span class="n">Writes</span><span class="p">(</span><span class="n">producedObject</span><span class="p">)</span>
			<span class="o">...</span>
			<span class="n">routes</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">routes</span><span class="p">,</span> <span class="n">route</span><span class="p">)</span>
			<span class="o">...</span>
	<span class="p">}</span>
	<span class="o">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Different CRUD actions register with different handlers. All the handlers form <code class="language-plaintext highlighter-rouge">routes</code> which will be routed by server.</p>

<h4 id="register-filter">Register Filter</h4>
<p>A configuration is defined as a <code class="language-plaintext highlighter-rouge">filter</code>. The setup of filters is as follows:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/d74ab9e1a4929be208d4529fd12b76d3fcd5d546/staging/src/k8s.io/apiserver/pkg/server/config.go#L671</span>
<span class="k">func</span> <span class="n">DefaultBuildHandlerChain</span><span class="p">(</span><span class="n">apiHandler</span> <span class="n">http</span><span class="o">.</span><span class="n">Handler</span><span class="p">,</span> <span class="n">c</span> <span class="o">*</span><span class="n">Config</span><span class="p">)</span> <span class="n">http</span><span class="o">.</span><span class="n">Handler</span> <span class="p">{</span>
	<span class="n">handler</span> <span class="o">:=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithAuthorization</span><span class="p">(</span><span class="n">apiHandler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Authorization</span><span class="o">.</span><span class="n">Authorizer</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Serializer</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithImpersonation</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Authorization</span><span class="o">.</span><span class="n">Authorizer</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Serializer</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithAudit</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditBackend</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditPolicyChecker</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">LongRunningFunc</span><span class="p">)</span>
	<span class="n">failedHandler</span> <span class="o">:=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">Unauthorized</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">Serializer</span><span class="p">)</span>
	<span class="n">failedHandler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithFailedAuthenticationAudit</span><span class="p">(</span><span class="n">failedHandler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditBackend</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditPolicyChecker</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithAuthentication</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Authentication</span><span class="o">.</span><span class="n">Authenticator</span><span class="p">,</span> <span class="n">failedHandler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">Authentication</span><span class="o">.</span><span class="n">APIAudiences</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericfilters</span><span class="o">.</span><span class="n">WithCORS</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">CorsAllowedOriginList</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="no">nil</span><span class="p">,</span> <span class="s">"true"</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericfilters</span><span class="o">.</span><span class="n">WithTimeoutForNonLongRunningRequests</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">LongRunningFunc</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">RequestTimeout</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericfilters</span><span class="o">.</span><span class="n">WithWaitGroup</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">LongRunningFunc</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">HandlerChainWaitGroup</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithRequestInfo</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">RequestInfoResolver</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithAuditAnnotations</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditBackend</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">AuditPolicyChecker</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithWarningRecorder</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithCacheControl</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericapifilters</span><span class="o">.</span><span class="n">WithRequestReceivedTimestamp</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span>
	<span class="n">handler</span> <span class="o">=</span> <span class="n">genericfilters</span><span class="o">.</span><span class="n">WithPanicRecovery</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">handler</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In general, configuration is done by linking various <strong><em>filters</em></strong> together.</p>

<h3 id="serve">Serve</h3>
<p>In essence, serve is the action of providing proper response given a request.
Let’s see what serve process actually looks like in kube-apiserver:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/13b6a929bc945f2bb97dbf7cd7f0fdd02b49bc0f/cmd/kube-apiserver/app/server.go#L161</span>
<span class="k">func</span> <span class="n">Run</span><span class="p">(</span><span class="n">completeOptions</span> <span class="n">completedServerRunOptions</span><span class="p">,</span> <span class="n">stopCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{})</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="n">server</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">CreateServerChain</span><span class="p">(</span><span class="n">completeOptions</span><span class="p">,</span> <span class="n">stopCh</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="n">prepared</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">server</span><span class="o">.</span><span class="n">PrepareRun</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">prepared</span><span class="o">.</span><span class="n">Run</span><span class="p">(</span><span class="n">stopCh</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">Run</code> function is associated with cobra.Command which forms kube-apiserver binary. It provides the main serve logic.</p>

<p>The process is clear in code. First do necessary configuration(register handler, filter, etc), then some preparation work, finally actually run the server.</p>

<p>The prepared server is run as follows:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/4362d613f243a02558f03e90b8fcb58b4c6efb06/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go#L316</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="n">preparedGenericAPIServer</span><span class="p">)</span> <span class="n">Run</span><span class="p">(</span><span class="n">stopCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{})</span> <span class="kt">error</span> <span class="p">{</span>
	<span class="o">...</span>

	<span class="c">// close socket after delayed stopCh</span>
	<span class="n">stoppedCh</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">NonBlockingRun</span><span class="p">(</span><span class="n">delayedStopCh</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="o">&lt;-</span><span class="n">stopCh</span>

	<span class="c">// run shutdown hooks directly. This includes deregistering from the kubernetes endpoint in case of kube-apiserver.</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">RunPreShutdownHooks</span><span class="p">()</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="n">err</span>
	<span class="p">}</span>
	<span class="o">...</span>
	<span class="c">// Wait for all requests to finish, which are bounded by the RequestTimeout variable.</span>
	<span class="n">s</span><span class="o">.</span><span class="n">HandlerChainWaitGroup</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>

	<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can see that server is run in a non-blocking fashion. Also there are post-start/pre-shutdown hooks which provide extra customizability. There is also code to ensure graceful shutdown of server.</p>

<p>Finally, the core serving logic:</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/kubernetes/kubernetes/blob/2c3687c255c014f7049eed159de30a82082656b6/staging/src/k8s.io/apiserver/pkg/server/secure_serving.go#L147</span>
<span class="k">func</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span><span class="n">SecureServingInfo</span><span class="p">)</span> <span class="n">Serve</span><span class="p">(</span><span class="n">handler</span> <span class="n">http</span><span class="o">.</span><span class="n">Handler</span><span class="p">,</span> <span class="n">shutdownTimeout</span> <span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">,</span> <span class="n">stopCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{})</span> <span class="p">(</span><span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">if</span> <span class="n">s</span><span class="o">.</span><span class="n">Listener</span> <span class="o">==</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"listener must not be nil"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="n">tlsConfig</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">s</span><span class="o">.</span><span class="n">tlsConfig</span><span class="p">(</span><span class="n">stopCh</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
		<span class="k">return</span> <span class="no">nil</span><span class="p">,</span> <span class="n">err</span>
	<span class="p">}</span>

	<span class="n">secureServer</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">http</span><span class="o">.</span><span class="n">Server</span><span class="p">{</span>
		<span class="n">Addr</span><span class="o">:</span>           <span class="n">s</span><span class="o">.</span><span class="n">Listener</span><span class="o">.</span><span class="n">Addr</span><span class="p">()</span><span class="o">.</span><span class="n">String</span><span class="p">(),</span>
		<span class="n">Handler</span><span class="o">:</span>        <span class="n">handler</span><span class="p">,</span>
		<span class="n">MaxHeaderBytes</span><span class="o">:</span> <span class="m">1</span> <span class="o">&lt;&lt;</span> <span class="m">20</span><span class="p">,</span>
		<span class="n">TLSConfig</span><span class="o">:</span>      <span class="n">tlsConfig</span><span class="p">,</span>
	<span class="p">}</span>
	<span class="o">...</span>
	<span class="c">// use tlsHandshakeErrorWriter to handle messages of tls handshake error</span>
	<span class="n">tlsErrorWriter</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">tlsHandshakeErrorWriter</span><span class="p">{</span><span class="n">os</span><span class="o">.</span><span class="n">Stderr</span><span class="p">}</span>
	<span class="n">tlsErrorLogger</span> <span class="o">:=</span> <span class="n">log</span><span class="o">.</span><span class="n">New</span><span class="p">(</span><span class="n">tlsErrorWriter</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
	<span class="n">secureServer</span><span class="o">.</span><span class="n">ErrorLog</span> <span class="o">=</span> <span class="n">tlsErrorLogger</span>

	<span class="n">klog</span><span class="o">.</span><span class="n">Infof</span><span class="p">(</span><span class="s">"Serving securely on %s"</span><span class="p">,</span> <span class="n">secureServer</span><span class="o">.</span><span class="n">Addr</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">RunServer</span><span class="p">(</span><span class="n">secureServer</span><span class="p">,</span> <span class="n">s</span><span class="o">.</span><span class="n">Listener</span><span class="p">,</span> <span class="n">shutdownTimeout</span><span class="p">,</span> <span class="n">stopCh</span><span class="p">)</span>
<span class="p">}</span>
<span class="c">//source: https://github.com/kubernetes/kubernetes/blob/2c3687c255c014f7049eed159de30a82082656b6/staging/src/k8s.io/apiserver/pkg/server/secure_serving.go#L207</span>
<span class="k">func</span> <span class="n">RunServer</span><span class="p">(</span>
	<span class="n">server</span> <span class="o">*</span><span class="n">http</span><span class="o">.</span><span class="n">Server</span><span class="p">,</span>
	<span class="n">ln</span> <span class="n">net</span><span class="o">.</span><span class="n">Listener</span><span class="p">,</span>
	<span class="n">shutDownTimeout</span> <span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">,</span>
	<span class="n">stopCh</span> <span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{},</span>
<span class="p">)</span> <span class="p">(</span><span class="o">&lt;-</span><span class="k">chan</span> <span class="k">struct</span><span class="p">{},</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
	<span class="o">...</span>
	<span class="k">go</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="k">defer</span> <span class="n">utilruntime</span><span class="o">.</span><span class="n">HandleCrash</span><span class="p">()</span>

		<span class="k">var</span> <span class="n">listener</span> <span class="n">net</span><span class="o">.</span><span class="n">Listener</span>
		<span class="n">listener</span> <span class="o">=</span> <span class="n">tcpKeepAliveListener</span><span class="p">{</span><span class="n">ln</span><span class="p">}</span>
		<span class="k">if</span> <span class="n">server</span><span class="o">.</span><span class="n">TLSConfig</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
			<span class="n">listener</span> <span class="o">=</span> <span class="n">tls</span><span class="o">.</span><span class="n">NewListener</span><span class="p">(</span><span class="n">listener</span><span class="p">,</span> <span class="n">server</span><span class="o">.</span><span class="n">TLSConfig</span><span class="p">)</span>
		<span class="p">}</span>

		<span class="n">err</span> <span class="o">:=</span> <span class="n">server</span><span class="o">.</span><span class="n">Serve</span><span class="p">(</span><span class="n">listener</span><span class="p">)</span>

		<span class="n">msg</span> <span class="o">:=</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"Stopped listening on %s"</span><span class="p">,</span> <span class="n">ln</span><span class="o">.</span><span class="n">Addr</span><span class="p">()</span><span class="o">.</span><span class="n">String</span><span class="p">())</span>
		<span class="k">select</span> <span class="p">{</span>
		<span class="k">case</span> <span class="o">&lt;-</span><span class="n">stopCh</span><span class="o">:</span>
			<span class="n">klog</span><span class="o">.</span><span class="n">Info</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
		<span class="k">default</span><span class="o">:</span>
			<span class="nb">panic</span><span class="p">(</span><span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"%s due to error: %v"</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">err</span><span class="p">))</span>
		<span class="p">}</span>
	<span class="p">}()</span>
	<span class="k">return</span> <span class="n">stoppedCh</span><span class="p">,</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Apart from auxiliary code such as TLS config and graceful-shutdown, we can see that core serving logic is straight forward. Golang’s default <code class="language-plaintext highlighter-rouge">http.Server</code> is used to handle incoming request. Whenever there is a new connection, a new go routine is created to serve it. No thread pool, no task queue, just pure concurrency.</p>

<p>Why so simple? Because <code class="language-plaintext highlighter-rouge">goroutine</code> hides most concurrency details for us. Unlike an OS thread which is quite primitive and bare-metal, goroutine implements a user space “green-thread” which has many powerful features. Examples are channels for thread communication, goroutine scheduling/multiplex, etc.</p>

<p>Thus, user can write simple/clean concurrency code without trapped in multi-threading messes. That may be the reason why golang is such popular in distributed system and the cornerstone projects like Docker/Kubernetes are all written in go.</p>]]></content><author><name></name></author><category term="cloud" /><category term="kubernetes" /><category term="golang" /><category term="client" /><category term="server" /><summary type="html"><![CDATA[Analysis of kubectl and kube-apiserver projects.]]></summary></entry><entry><title type="html">Kubernetes Project Exploration, Part 1 - A brief overview of Kubernetes project stack</title><link href="https://blog.labxq.com/cloud/2020/09/24/kubernetes-exploration-part1-overview-of-k8s-stack.html" rel="alternate" type="text/html" title="Kubernetes Project Exploration, Part 1 - A brief overview of Kubernetes project stack" /><published>2020-09-24T00:00:00-07:00</published><updated>2020-09-24T00:00:00-07:00</updated><id>https://blog.labxq.com/cloud/2020/09/24/kubernetes-exploration-part1-overview-of-k8s-stack</id><content type="html" xml:base="https://blog.labxq.com/cloud/2020/09/24/kubernetes-exploration-part1-overview-of-k8s-stack.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Kubernetes, an open-sourced container orchestration platform praised as the <strong>future of cloud infrastructure</strong>, gains more and more attention nowadays.  It extends resource management from a single machine to a cluster of machines, and thus can be seen as the <strong>operating system of cloud</strong>.</p>

<p>This article series will briefly introduce K8s project stack and mechanism of important components.</p>

<h2 id="kubernetes-project-stack">Kubernetes project stack</h2>
<p>Before deep diving into core logic of Kubernetes, let’s have quick overview of what k8s project stack consists of.</p>

<p>K8S is one of the best open source projects in terms of project organization and support. It sets an outstanding example of how a totally open-sourced project should be grouped, organized and hosted over Internet.</p>

<p>K8s project is located at <a href="https://github.com/kubernetes">here</a>.</p>

<h3 id="kubernetes">kubernetes</h3>
<p>This is the core project of K8S stack. Will discuss later on.</p>

<h3 id="community">community</h3>
<p>Each engineer engaged in an open-sourced project once felt the pain of reading obscure code alone without any help.</p>

<p>Developers are usually from different places of the world and it is hard for them to sit together in reality and discuss like team/friends. This is the most challenging part for open-source projects.</p>

<p>For an open source project to succeed, not only code, but people need to be organized as well. That’s why k8s provide tremendous support for community.</p>

<h4 id="special-interest-groups-sigs">Special Interest Groups (SIGs)</h4>
<p>SIG is a persistent open groups that focus on a part of the project. Example SIGs are <code class="language-plaintext highlighter-rouge">sig-node, sig-scheduling</code>, etc.</p>

<p>Each SIG corresponds to a folder in <a href="https://github.com/kubernetes/community">community project</a>, thus you can easily find all the docs/events/etc you want of a SIG.</p>

<h4 id="working-groupswg">Working Groups(WG)</h4>
<p>WGs mainly focus on addressing issues across SIGs.</p>

<h4 id="communication">Communication</h4>
<h5 id="instant-messaging">Instant Messaging</h5>
<p>Slack is popular among k8s communities nowadays. Just pick a SIG you are interested in and join that channel.</p>
<h5 id="conference">Conference</h5>
<p>KubeCon, powered by CNCF, holds regular meetings each year where people share thoughts and innovation about Kubernetes.</p>

<h3 id="enhancements">enhancements</h3>
<p>When developing a brand new feature for a project, it’s usually not very stable and not totally ready for GA yet. Thus we need a “staging” space to put new features temporarily.</p>

<p>After the feature is stable enough, we move the feature to the main repo. Kubernete’s <code class="language-plaintext highlighter-rouge">enhancements</code> repo serves as this purpose. It functions much like OpenCV’s opencv_contrib repo.</p>

<h3 id="test-infra">test-infra</h3>
<p>Everybody once submitted a PR to K8S project knows how powerful(even verbose :wink:) the code submission process is. This project contains all tools/scripts to make K8S projects’ CI/CD pipeline as complete and reliable as possible. Each PR triggers several checks/tests and you can interact with k8s-bot on github directly.</p>

<h3 id="minikube">minikube</h3>
<p>Would like to try K8S but no machines at hand? Minikube is your best friend. 
Minikube is most suitable for a newbie to try out k8s. It can also serve as a playground to test configuration, deployment, etc.</p>

<h3 id="dashboard">dashboard</h3>
<p>Like any mature projects, k8s provides a dashboard to visualize/operate its internal state as well. It is an convenient out-of-the-box solution, but not very flexible and customizable.</p>

<p>On many k8s-based platforms it is replaced with third-party choices. I personally use Grafana for metrics visualization and Vscode k8s extension for cluster operation/inspection.</p>

<h3 id="website">website</h3>
<p>The k8s homepage <a href="https://kubernetes.io/">https://kubernetes.io/</a> is open-sourced as well, try submitting a PR for typos when you catch one:sunglasses:.</p>

<h2 id="kubernetes-core-project">Kubernetes core project</h2>
<p>The <a href="https://github.com/kubernetes/kubernetes">kubernetes/kubernetes</a> project contains core functionalities of k8s.</p>

<p>Its file structure is as follows:</p>
<h3 id="build">build</h3>
<p>This folder contains all the build scripts. The main script for building components binaries is <code class="language-plaintext highlighter-rouge">build/run.sh</code>. The build process runs in a Docker container to provide consistent build environment.</p>
<h3 id="cmd">cmd</h3>
<p>This folder contains source code of all the executables(<code class="language-plaintext highlighter-rouge">kubectl</code>, <code class="language-plaintext highlighter-rouge">kubelet</code>, etc). 
It uses <code class="language-plaintext highlighter-rouge">cobra.Command</code> library to handle logic of creating command-line binary and setup options and configs.</p>
<h3 id="pkg">pkg</h3>
<p>This folder is where most k8s core logic live. Each components corresponds to a sub-folder. Start here if you would like to deep-dive into one specific k8s component.</p>
<h3 id="staging">staging</h3>
<p>As Kubernetes project becomes larger and larger, different components are no more suitable to live in a single project. However, scattering source code everywhere will make it painful to lookup.</p>

<p>K8S project resolves this problem by introducing <code class="language-plaintext highlighter-rouge">staging</code> folder. The source code of different components are still checked into <code class="language-plaintext highlighter-rouge">kubernetes/kubernetes</code> project at staging folder, but they are later on published to external <code class="language-plaintext highlighter-rouge">k8s.io</code> repositories(e.g. <code class="language-plaintext highlighter-rouge">k8s.io/kube-scheduler</code>) and referenced by other projects through <code class="language-plaintext highlighter-rouge">k8s.io</code> repo.</p>]]></content><author><name></name></author><category term="cloud" /><category term="kubernetes" /><category term="software-engineering" /><summary type="html"><![CDATA[Overview of Kubernetes stack and core project structure.]]></summary></entry><entry><title type="html">NVIDIA Docker deep dive</title><link href="https://blog.labxq.com/cloud/2020/09/13/nvidia-docker-deep-dive.html" rel="alternate" type="text/html" title="NVIDIA Docker deep dive" /><published>2020-09-13T00:00:00-07:00</published><updated>2020-09-13T00:00:00-07:00</updated><id>https://blog.labxq.com/cloud/2020/09/13/nvidia-docker-deep-dive</id><content type="html" xml:base="https://blog.labxq.com/cloud/2020/09/13/nvidia-docker-deep-dive.html"><![CDATA[<h2 id="what-is-nvidia-docker">What is NVIDIA Docker</h2>
<p>NVIDIA Docker is a project providing GPU support to Docker container. It is the cornerstone of NVIDIA NGC and all container-based AI platforms.</p>

<h2 id="why-nvidia-docker-is-needed">Why NVIDIA Docker is needed</h2>
<p>NVIDIA Docker drastically simplifies deployment of GPU based application. Everybody using Linux knows how painful it is to handle NVIDIA driver stack on Linux(remember the famous dirty words Linus Torvalds threw to NVIDIA? ;-P).</p>

<p>With NVIDIA Docker, we can “pass-through” GPU from host to container, thus eliminate the work needed to manually configure GPU inside container.</p>

<h2 id="installation">Installation</h2>
<p>In general, we need to install Docker engine, NVIDIA driver and NVIDIA Docker library on host.</p>

<p>See this link for detailed instructions: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">Installation guide</a></p>

<h2 id="usage">Usage</h2>
<p>Add <code class="language-plaintext highlighter-rouge">--gpus</code> option to DockerCLI command when starting a container. The started container will have access to host’s GPUs.
Example:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$docker</span> run <span class="nt">--rm</span> <span class="nt">--gpus</span> all ubuntu nvidia-smi
<span class="c">#output:</span>
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|<span class="o">===============================</span>+<span class="o">======================</span>+<span class="o">======================</span>|
|   0  GeForce GTX 1060    On   | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P8     6W /  N/A |     11MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|<span class="o">=============================================================================</span>|
+-----------------------------------------------------------------------------+

</code></pre></div></div>

<h2 id="overview-of-docker-stack">Overview of Docker Stack</h2>
<p>Before diving into mechanism of NVIDIA Docker, let’s have a quick review of what is under the hood of Docker:</p>
<h3 id="architecture">Architecture</h3>
<p><img src="/assets/images/Cloud-docker_architecture.png" alt="docker-architecture" /></p>

<p>In general there are 2 layers. Upper layer interacts with user, lower layer handles core logic of container.</p>
<h3 id="dockerclidockerd">DockerCLI/Dockerd</h3>
<p>These two components, with client-server pattern, together form the user interface of Docker.</p>

<p>DockerCLI is the client user interacts with Docker. Dockerd is a daemon server listening to client requests and forwarding them to containerd.</p>

<h3 id="containerd">containerd</h3>
<p>Containerd is the engine of Docker Container. It implements every and only logic about container including container lifecycle management, image management, etc.</p>

<h3 id="runc">runc</h3>
<p>runc is a CLI tool for spawning and running containers according to the OCI specification. Containerd uses this tool to actually start a new container.</p>
<h4 id="ociopen-container-initiative-runtime-spec">OCI(Open Container Initiative) runtime-spec</h4>
<p>In order to run a container, user must provide runtime-spec(a config.json file) to runc describing the process to be run.</p>
<h5 id="config-file-example">Config file example</h5>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"ociVersion"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1.0.1"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"process"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"terminal"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"user"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"uid"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
            </span><span class="nl">"gid"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="s2">"sh"</span><span class="w">
        </span><span class="p">],</span><span class="w">
        </span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="s2">"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"TERM=xterm"</span><span class="w">
        </span><span class="p">],</span><span class="w">
        </span><span class="nl">"cwd"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"capabilities"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"bounding"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                </span><span class="s2">"CAP_AUDIT_WRITE"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"CAP_KILL"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"CAP_NET_BIND_SERVICE"</span><span class="w">
            </span><span class="p">],</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="nl">"rlimits"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="p">{</span><span class="w">
                </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"RLIMIT_CORE"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"hard"</span><span class="p">:</span><span class="w"> </span><span class="mi">1024</span><span class="p">,</span><span class="w">
                </span><span class="nl">"soft"</span><span class="p">:</span><span class="w"> </span><span class="mi">1024</span><span class="w">
            </span><span class="p">},</span><span class="w">
        </span><span class="p">],</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"root"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"rootfs"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"readonly"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"mounts"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"destination"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/proc"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"proc"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"proc"</span><span class="w">
        </span><span class="p">},</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"prestart"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
            </span><span class="p">{</span><span class="w">
                </span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/usr/bin/fix-mounts"</span><span class="p">,</span><span class="w">
                </span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                    </span><span class="s2">"fix-mounts"</span><span class="p">,</span><span class="w">
                    </span><span class="s2">"arg1"</span><span class="p">,</span><span class="w">
                    </span><span class="s2">"arg2"</span><span class="w">
                </span><span class="p">],</span><span class="w">
                </span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                    </span><span class="s2">"key1=value1"</span><span class="w">
                </span><span class="p">]</span><span class="w">
            </span><span class="p">},</span><span class="w">
            </span><span class="p">{</span><span class="w">
                </span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/usr/bin/setup-network"</span><span class="w">
            </span><span class="p">}</span><span class="w">
        </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>From the example, we can see that:</p>
<ol>
  <li>Container is nothing but a process. Thus a <code class="language-plaintext highlighter-rouge">process</code> field is required.  <br />
1.1 cwd: Working directory of the process;<br />
1.2 args: Command and arguments to be executed;<br />
1.3 env: environment variables;</li>
  <li>We need a <code class="language-plaintext highlighter-rouge">root</code> field specifying root filesystem the container process is running in. <code class="language-plaintext highlighter-rouge">mount</code> field specifies extra filesystem to be mounted.</li>
  <li><strong>Hooks</strong>: Hooks are actions that can be executed at different stages of container life-cycle. It can be used to customize container’s behavior. <strong>NVIDIA Docker uses hooks to inject GPU functionalities into container.</strong></li>
</ol>

<h2 id="mechanism-of-nvidia-docker">Mechanism of NVIDIA Docker</h2>
<h3 id="architecture-1">Architecture</h3>
<p><img src="/assets/images/Cloud-nvidia_docker_arch.png" alt="nvidia-docker-architecture" /></p>

<p>From previous discussion, we can see that the key point is to provide a hook to inject GPU into container.</p>

<p>In general, <code class="language-plaintext highlighter-rouge">containerd</code> inside Docker Engine needs to set up hook given the gpu options from DockerCLI. The gpu hooking logic itself is provided by a library called <strong>libnvidia-container</strong>.</p>

<h3 id="libnvidia-container">libnvidia-container</h3>
<p>This project provides a command line program <code class="language-plaintext highlighter-rouge">nvidia-container-cli</code>.</p>

<p>The cli has several commands. <code class="language-plaintext highlighter-rouge">list</code> command lists the components required in order to configure a container with GPU support. <code class="language-plaintext highlighter-rouge">configure</code> command does the actual GPU set-up inside container.</p>

<p>Following is the example output of <code class="language-plaintext highlighter-rouge">list</code>:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$nvidia</span><span class="nt">-container-cli</span> list
<span class="c">#output:</span>
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.450.51.06
/usr/lib/x86_64-linux-gnu/libcuda.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.51.06
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.51.06
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.450.51.06
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.450.51.06
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.450.51.06
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.450.51.06
</code></pre></div></div>
<p>We can see that there are different GPU related components such as nvidia-smi, cuda and gpu driver.</p>

<p>In general, what <code class="language-plaintext highlighter-rouge">nvidia-container-cli configure</code> does is to <strong>mount</strong> above components to enable GPU support of container.</p>

<h2 id="code-analysis">Code Analysis</h2>
<p>Whoa, coding time! Let’s read the actual source code that implements nvidia-docker.</p>
<h3 id="process-of-enabling-gpu-support-of-a-container">Process of enabling GPU support of a container</h3>
<h4 id="containerd-1">containerd</h4>
<p>The <code class="language-plaintext highlighter-rouge">--gpus</code> option is parsed when containerd creates a new container:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/containerd/containerd/blob/bc4c3813997554d14449b34d336bca2513e84f96/cmd/ctr/commands/run/run_unix.go#L74</span>
<span class="k">func</span> <span class="n">NewContainer</span><span class="p">(</span><span class="n">ctx</span> <span class="n">gocontext</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">client</span> <span class="o">*</span><span class="n">containerd</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">context</span> <span class="o">*</span><span class="n">cli</span><span class="o">.</span><span class="n">Context</span><span class="p">)</span> <span class="p">(</span><span class="n">containerd</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="o">...</span>
    <span class="k">var</span> <span class="p">(</span>
		<span class="n">opts</span>  <span class="p">[]</span><span class="n">oci</span><span class="o">.</span><span class="n">SpecOpts</span>
		<span class="n">cOpts</span> <span class="p">[]</span><span class="n">containerd</span><span class="o">.</span><span class="n">NewContainerOpts</span>
		<span class="n">spec</span>  <span class="n">containerd</span><span class="o">.</span><span class="n">NewContainerOpts</span>
    <span class="p">)</span>
    <span class="o">...</span>
    <span class="k">if</span> <span class="n">context</span><span class="o">.</span><span class="n">IsSet</span><span class="p">(</span><span class="s">"gpus"</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">opts</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">opts</span><span class="p">,</span> <span class="n">nvidia</span><span class="o">.</span><span class="n">WithGPUs</span><span class="p">(</span><span class="n">nvidia</span><span class="o">.</span><span class="n">WithDevices</span><span class="p">(</span><span class="n">context</span><span class="o">.</span><span class="n">Int</span><span class="p">(</span><span class="s">"gpus"</span><span class="p">)),</span> <span class="n">nvidia</span><span class="o">.</span><span class="n">WithAllCapabilities</span><span class="p">))</span>
    <span class="p">}</span>
    <span class="o">...</span>
    <span class="n">spec</span> <span class="o">=</span> <span class="n">containerd</span><span class="o">.</span><span class="n">WithSpec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">,</span> <span class="n">opts</span><span class="o">...</span><span class="p">)</span>
    <span class="o">...</span>
    <span class="k">return</span> <span class="n">client</span><span class="o">.</span><span class="n">NewContainer</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">cOpts</span><span class="o">...</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">--gpus</code> option corresponds to the function <code class="language-plaintext highlighter-rouge">WithGPUs</code>. <code class="language-plaintext highlighter-rouge">WithGPUs</code> returns a function that sets the NVIDIA hooks.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">//source: https://github.com/containerd/containerd/blob/bc4c3813997554d14449b34d336bca2513e84f96/contrib/nvidia/nvidia.go#L68</span>
<span class="k">const</span> <span class="n">NvidiaCLI</span> <span class="o">=</span> <span class="s">"nvidia-container-cli"</span>
<span class="o">...</span>
<span class="k">func</span> <span class="n">WithGPUs</span><span class="p">(</span><span class="n">opts</span> <span class="o">...</span><span class="n">Opts</span><span class="p">)</span> <span class="n">oci</span><span class="o">.</span><span class="n">SpecOpts</span> <span class="p">{</span>
	<span class="k">return</span> <span class="k">func</span><span class="p">(</span><span class="n">_</span> <span class="n">context</span><span class="o">.</span><span class="n">Context</span><span class="p">,</span> <span class="n">_</span> <span class="n">oci</span><span class="o">.</span><span class="n">Client</span><span class="p">,</span> <span class="n">_</span> <span class="o">*</span><span class="n">containers</span><span class="o">.</span><span class="n">Container</span><span class="p">,</span> <span class="n">s</span> <span class="o">*</span><span class="n">specs</span><span class="o">.</span><span class="n">Spec</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
		<span class="o">...</span>
		<span class="n">nvidiaPath</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">exec</span><span class="o">.</span><span class="n">LookPath</span><span class="p">(</span><span class="n">NvidiaCLI</span><span class="p">)</span>
		<span class="o">...</span>
		<span class="n">s</span><span class="o">.</span><span class="n">Hooks</span><span class="o">.</span><span class="n">Prestart</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">Hooks</span><span class="o">.</span><span class="n">Prestart</span><span class="p">,</span> <span class="n">specs</span><span class="o">.</span><span class="n">Hook</span><span class="p">{</span>
			<span class="n">Path</span><span class="o">:</span> <span class="n">c</span><span class="o">.</span><span class="n">OCIHookPath</span><span class="p">,</span>
			<span class="n">Args</span><span class="o">:</span> <span class="nb">append</span><span class="p">([]</span><span class="kt">string</span><span class="p">{</span>
				<span class="s">"containerd"</span><span class="p">,</span>
				<span class="s">"oci-hook"</span><span class="p">,</span>
				<span class="s">"--"</span><span class="p">,</span>
				<span class="n">nvidiaPath</span><span class="p">,</span>
				<span class="c">// ensures the required kernel modules are properly loaded</span>
				<span class="s">"--load-kmods"</span><span class="p">,</span>
			<span class="p">},</span> <span class="n">c</span><span class="o">.</span><span class="n">args</span><span class="p">()</span><span class="o">...</span><span class="p">),</span>
			<span class="n">Env</span><span class="o">:</span> <span class="n">os</span><span class="o">.</span><span class="n">Environ</span><span class="p">(),</span>
		<span class="p">})</span>
		<span class="k">return</span> <span class="no">nil</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="libnvidia-container-1">libnvidia-container</h4>
<p>The configuration options is passed to <code class="language-plaintext highlighter-rouge">nvidia-container-cli</code> as command line options. It looks like:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/NVIDIA/libnvidia-container/blob/e6e1c4860d9694608217737c31fc844ef8b9dfd7/src/cli/configure.c#L18</span>
<span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">argp_option</span><span class="p">[]){</span>
    <span class="p">{</span><span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Options:"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"pid"</span><span class="p">,</span> <span class="sc">'p'</span><span class="p">,</span> <span class="s">"PID"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Container PID"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"device"</span><span class="p">,</span> <span class="sc">'d'</span><span class="p">,</span> <span class="s">"ID"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Device UUID(s) or index(es) to isolate"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"require"</span><span class="p">,</span> <span class="sc">'r'</span><span class="p">,</span> <span class="s">"EXPR"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Check container requirements"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"ldconfig"</span><span class="p">,</span> <span class="sc">'l'</span><span class="p">,</span> <span class="s">"PATH"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Path to the ldconfig binary"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"compute"</span><span class="p">,</span> <span class="sc">'c'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable compute capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"utility"</span><span class="p">,</span> <span class="sc">'u'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable utility capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"video"</span><span class="p">,</span> <span class="sc">'v'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable video capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"graphics"</span><span class="p">,</span> <span class="sc">'g'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable graphics capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"display"</span><span class="p">,</span> <span class="sc">'D'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable display capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"ngx"</span><span class="p">,</span> <span class="sc">'n'</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable ngx capability"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"compat32"</span><span class="p">,</span> <span class="mh">0x80</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable 32bits compatibility"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"mig-config"</span><span class="p">,</span> <span class="mh">0x81</span><span class="p">,</span> <span class="s">"ID"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable configuration of MIG devices"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"mig-monitor"</span><span class="p">,</span> <span class="mh">0x82</span><span class="p">,</span> <span class="s">"ID"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Enable monitoring of MIG devices"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"no-cgroups"</span><span class="p">,</span> <span class="mh">0x83</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Don't use cgroup enforcement"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"no-devbind"</span><span class="p">,</span> <span class="mh">0x84</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Don't bind mount devices"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">},</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can choose what GPU capabilities are needed by container through specifying args.</p>

<p>Then options are configured in <code class="language-plaintext highlighter-rouge">configure_command</code> function:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/NVIDIA/libnvidia-container/blob/e6e1c4860d9694608217737c31fc844ef8b9dfd7/src/cli/configure.c#L187</span>
<span class="kt">int</span> <span class="nf">configure_command</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">context</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nvc_driver_mount</span><span class="p">(</span><span class="n">nvc</span><span class="p">,</span> <span class="n">cnt</span><span class="p">,</span> <span class="n">drv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">warnx</span><span class="p">(</span><span class="s">"mount error: %s"</span><span class="p">,</span> <span class="n">nvc_error</span><span class="p">(</span><span class="n">nvc</span><span class="p">));</span>
        <span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Options are passed to <code class="language-plaintext highlighter-rouge">nvc_driver_mount</code> for actual mount operations. Here is the <code class="language-plaintext highlighter-rouge">nvc_driver_mount</code> function:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/NVIDIA/libnvidia-container/blob/b2fd9616cd544f780b8d63357e747e7e96281743/src/nvc_mount.c#L709</span>
<span class="kt">int</span>
<span class="nf">nvc_driver_mount</span><span class="p">(</span><span class="k">struct</span> <span class="nc">nvc_context</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="nc">nvc_container</span> <span class="o">*</span><span class="n">cnt</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="nc">nvc_driver_info</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="cm">/* Procfs mount */</span>
    <span class="p">...</span>

    <span class="cm">/* Application profile mount */</span>
    <span class="p">...</span>
    <span class="cm">/* Host binary and library mounts */</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-&gt;</span><span class="n">bins</span> <span class="o">!=</span> <span class="nb">NULL</span> <span class="o">&amp;&amp;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">nbins</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">((</span><span class="n">tmp</span> <span class="o">=</span> <span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="p">)</span><span class="n">mount_files</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">err</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">cfg</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="n">cnt</span><span class="p">,</span> <span class="n">cnt</span><span class="o">-&gt;</span><span class="n">cfg</span><span class="p">.</span><span class="n">bins_dir</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">bins</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">nbins</span><span class="p">))</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
        <span class="n">ptr</span> <span class="o">=</span> <span class="n">array_append</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="n">tmp</span><span class="p">,</span> <span class="n">array_size</span><span class="p">(</span><span class="n">tmp</span><span class="p">));</span>
        <span class="n">free</span><span class="p">(</span><span class="n">tmp</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="p">...</span>

    <span class="cm">/* IPC mounts */</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">nipcs</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">((</span><span class="o">*</span><span class="n">ptr</span><span class="o">++</span> <span class="o">=</span> <span class="n">mount_ipc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">err</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">cfg</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="n">cnt</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">ipcs</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="cm">/* Device mounts */</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">ndevs</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">cnt</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">OPT_NO_DEVBIND</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">((</span><span class="o">*</span><span class="n">ptr</span><span class="o">++</span> <span class="o">=</span> <span class="n">mount_device</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">err</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">cfg</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="n">cnt</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">info</span><span class="o">-&gt;</span><span class="n">devs</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
                    <span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">mount_*</code> functions are thin wrappers of Linux system call <code class="language-plaintext highlighter-rouge">mount</code>.</p>

<p>See? Nothing fancy, we just <strong>mount relevant binaries/devices one by one to container’s filesystem</strong>. This is the <strong>core logic</strong> of NVIDIA Docker.</p>

<h2 id="summary">Summary</h2>
<p>Here are the key takeaways of this article:</p>
<ol>
  <li>NVIDIA Docker provides full GPU support to container by a single <code class="language-plaintext highlighter-rouge">--gpus</code> option;</li>
  <li>NVIDIA Docker serves as a hook of containerd providing customized functionalities(e.g. GPU support) to Docker container.</li>
  <li>NVIDIA Docker’s core library libnvidia-container is implemented by mounting host OS’s NVIDIA GPU driver components inside Docker Container’s filesystem.</li>
</ol>]]></content><author><name></name></author><category term="cloud" /><category term="gpu" /><category term="docker" /><category term="linux" /><category term="C/C++" /><category term="golang" /><summary type="html"><![CDATA[Tutorial, mechanism and code analysis of NVIDIA Docker.]]></summary></entry><entry><title type="html">Use OpenCV’s CUDA DNN module and YOLOv4 model to accelerate real-time object detection with GPU</title><link href="https://blog.labxq.com/ai/2020/08/30/opencv-dnn-yolo-inference.html" rel="alternate" type="text/html" title="Use OpenCV’s CUDA DNN module and YOLOv4 model to accelerate real-time object detection with GPU" /><published>2020-08-30T00:00:00-07:00</published><updated>2020-08-30T00:00:00-07:00</updated><id>https://blog.labxq.com/ai/2020/08/30/opencv-dnn-yolo-inference</id><content type="html" xml:base="https://blog.labxq.com/ai/2020/08/30/opencv-dnn-yolo-inference.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Computer Vision is one of the main applications of current deep-learning-based AI wave. Compared to many AI research fields which are still in lab, CV is already largely deployed in production and used in all kinds of scenarios.</p>

<p>Originally developed by Intel and age 20 years now, OpenCV is a perfect tools for computer vision tasks.</p>

<p>In addition to traditional image processing use-cases such as image smoothing, edge detection, etc., OpenCV can also do Deep Neural Network inference. Thus, we can apply state-of-the-art computer vision NN models to project with the help of OpenCV.</p>

<h2 id="why-choose-opencv">Why choose OpenCV?</h2>
<p>I have a hobby real-time object-detection project written in C++. After video frame is obtained, I’m facing the choice of frameworks suitable for object-detection on frame.</p>

<p>After investigation, I found that popular AI frameworks such as Tensorflow, Pytorch are stronger at training models. However, for an AI application, only inference part is important.</p>

<p>Since the project already uses OpenCV for other frame handling work, there is no need to import another AI framework and external project dependency is minimized.</p>

<p>Our friend of this task is DNN module. It provides support for deep learning inference. In the past, OpenCV only supports CPU inference which limits its usage, especially for real-time cases.</p>

<p>Fortunately, a sub-module named cuda4dnn is added recently which provides CUDA support for DNN. This sub-module is backed by NVIDIA’s cuDNN library, so inference performance on NVIDIA GPU shall be guaranteed.</p>

<h2 id="code-walk-through">code walk-through</h2>
<p>The overall inference process is as follows:</p>
<h3 id="1-load-module">1. load module</h3>
<p>DNN supports multiple module formats including .pb(tensorflow), .onnx, etc. YOLO’s darknet model is also supported. To start, download pre-trained .weights file <a href="https://github.com/AlexeyAB/darknet">here</a>.</p>

<p>Load the pre-trained weights with <code class="language-plaintext highlighter-rouge">readNet</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cv</span><span class="o">::</span><span class="n">dnn</span><span class="o">::</span><span class="n">Net</span> <span class="n">net_</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">dnn</span><span class="o">::</span><span class="n">readNet</span><span class="p">(</span><span class="n">weights_location</span><span class="p">,</span> <span class="n">cfg_location</span><span class="p">);</span>
</code></pre></div></div>
<h3 id="2-set-backend-and-target">2. set backend and target</h3>
<p>DNN is a generalized inference engine supporting different backends including OpenCL, CUDA, FPGA, etc. Here we set CUDA as backend to use GPU for inference.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">net_</span><span class="p">.</span><span class="n">setPreferableBackend</span><span class="p">(</span><span class="n">cv</span><span class="o">::</span><span class="n">dnn</span><span class="o">::</span><span class="n">Backend</span><span class="o">::</span><span class="n">DNN_BACKEND_CUDA</span><span class="p">);</span>
<span class="n">net_</span><span class="p">.</span><span class="n">setPreferableTarget</span><span class="p">(</span><span class="n">cv</span><span class="o">::</span><span class="n">dnn</span><span class="o">::</span><span class="n">Target</span><span class="o">::</span><span class="n">DNN_TARGET_CUDA</span><span class="p">);</span>
</code></pre></div></div>
<h3 id="3-do-inference">3. do inference</h3>
<p>Use image blob as the input of DNN Net, do computation, then save the results in <code class="language-plaintext highlighter-rouge">outs</code> array:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">blob</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">dnn</span><span class="o">::</span><span class="n">blobFromImage</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="mi">255</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Size</span><span class="p">(</span><span class="mi">320</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">Scalar</span><span class="p">(),</span> <span class="nb">true</span><span class="p">,</span> <span class="nb">false</span><span class="p">,</span> <span class="n">CV_32F</span><span class="p">);</span>
<span class="n">net_</span><span class="p">.</span><span class="n">setInput</span><span class="p">(</span><span class="n">blob</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&gt;</span> <span class="n">outs</span><span class="p">;</span>
<span class="n">net_</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">outs</span><span class="p">,</span> <span class="n">net_</span><span class="p">.</span><span class="n">getUnconnectedOutLayersNames</span><span class="p">());</span>
</code></pre></div></div>
<h4 id="terminology">Terminology</h4>
<h5 id="mat">Mat</h5>
<p>This is OpenCV world’s tensor/numpy-array, representing an n-dimensional array.</p>

<p>Data is wrapped inside Mat object and no manual garbage-collection is needed, just like <code class="language-plaintext highlighter-rouge">std::vector</code>. Data manipulation thus is much simpler than raw array.</p>
<h5 id="blob">Blob</h5>
<p>The NCHW ordered 4-dimensional Mat transformed from input image. Blob served as the input of DNN Net.</p>

<h3 id="4-nmsnon-maximum-suppression-filtering">4. NMS(Non-maximum Suppression) filtering</h3>
<p>Object detection inference generates lots of similar candidate boxes. NMS is a technique that filters candidates. Here is what NMS process looks like:
<img src="/assets/images/AI-NMS_explanation.png" alt="Explanation of NMS" /></p>

<h3 id="5-for-each-detection-draw-a-bounding-box-on-frame">5. For each detection, draw a bounding box on frame</h3>
<p>The following code loops over all detections and draws a bounding box with classification name and confidence on the frame.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">indices</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">indices</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="n">cv</span><span class="o">::</span><span class="n">Rect2d</span> <span class="n">box</span> <span class="o">=</span> <span class="n">boxes</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
    <span class="kt">float</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">confidences</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">class_id</span> <span class="o">=</span> <span class="n">classIds</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
    <span class="n">add_bounding_box</span><span class="p">(</span><span class="n">class_names_</span><span class="p">,</span> <span class="n">classIds</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">confidences</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">box</span><span class="p">.</span><span class="n">x</span><span class="p">,</span> <span class="n">box</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">box</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">box</span><span class="p">.</span><span class="n">width</span><span class="p">,</span> <span class="n">box</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">box</span><span class="p">.</span><span class="n">height</span><span class="p">,</span> <span class="n">img</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can use OpenCV to draw bounding boxes as well!</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">add_bounding_box</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;</span> <span class="o">&amp;</span><span class="n">class_names</span><span class="p">,</span> <span class="kt">int</span> <span class="n">classId</span><span class="p">,</span> <span class="kt">float</span> <span class="n">conf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">top</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">,</span> <span class="kt">int</span> <span class="n">bottom</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">frame</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">cv</span><span class="o">::</span><span class="n">rectangle</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Point</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">top</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">Point</span><span class="p">(</span><span class="n">right</span><span class="p">,</span> <span class="n">bottom</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">Scalar</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>

    <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">label</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">format</span><span class="p">(</span><span class="s">"%.2f"</span><span class="p">,</span> <span class="n">conf</span><span class="p">);</span>
    <span class="n">label</span> <span class="o">=</span> <span class="n">class_names</span><span class="p">[</span><span class="n">classId</span><span class="p">]</span> <span class="o">+</span> <span class="s">": "</span> <span class="o">+</span> <span class="n">label</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">baseLine</span><span class="p">;</span>
    <span class="n">cv</span><span class="o">::</span><span class="n">Size</span> <span class="n">labelSize</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">getTextSize</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">FONT_HERSHEY_SIMPLEX</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">baseLine</span><span class="p">);</span>

    <span class="n">top</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">max</span><span class="p">(</span><span class="n">top</span><span class="p">,</span> <span class="n">labelSize</span><span class="p">.</span><span class="n">height</span><span class="p">);</span>
    <span class="n">cv</span><span class="o">::</span><span class="n">rectangle</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Point</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">top</span> <span class="o">-</span> <span class="n">labelSize</span><span class="p">.</span><span class="n">height</span><span class="p">),</span>
            <span class="n">cv</span><span class="o">::</span><span class="n">Point</span><span class="p">(</span><span class="n">left</span> <span class="o">+</span> <span class="n">labelSize</span><span class="p">.</span><span class="n">width</span><span class="p">,</span> <span class="n">top</span> <span class="o">+</span> <span class="n">baseLine</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">Scalar</span><span class="o">::</span><span class="n">all</span><span class="p">(</span><span class="mi">255</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">FILLED</span><span class="p">);</span>
    <span class="n">cv</span><span class="o">::</span><span class="n">putText</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">label</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Point</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">top</span><span class="p">),</span> <span class="n">cv</span><span class="o">::</span><span class="n">FONT_HERSHEY_SIMPLEX</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Scalar</span><span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">cv::rectangle</code> and <code class="language-plaintext highlighter-rouge">cv:putText</code> are 2 useful functions to help us draw bounding boxes and labels.</p>

<h2 id="how-dnn-module-is-implemented-internally">How DNN module is implemented internally</h2>
<p>Let’s deep dive into OpenCV’s source code to see how DNN is implemented.</p>
<h3 id="net">Net</h3>
<p>This class represents an artificial neural network model.</p>
<h4 id="readnet">readNet</h4>
<p>This function loads model file into Net instance.</p>
<h4 id="forward">forward</h4>
<p>The actual inference is done by this function. Let’s see what it looks like:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source file: https://github.com/opencv/opencv/blob/7ce518106ba041f3cbb27cafda9eb670e5bb99f3/modules/dnn/src/dnn.cpp#L3802</span>
<span class="kt">void</span> <span class="n">Net</span><span class="o">::</span><span class="n">forward</span><span class="p">(</span><span class="n">OutputArrayOfArrays</span> <span class="n">outputBlobs</span><span class="p">,</span> <span class="k">const</span> <span class="n">String</span><span class="o">&amp;</span> <span class="n">outputName</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="n">impl</span><span class="o">-&gt;</span><span class="n">setUpNet</span><span class="p">(</span><span class="n">pins</span><span class="p">);</span>
    <span class="n">impl</span><span class="o">-&gt;</span><span class="n">forwardToLayer</span><span class="p">(</span><span class="n">impl</span><span class="o">-&gt;</span><span class="n">getLayerData</span><span class="p">(</span><span class="n">layerName</span><span class="p">));</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">setUpNet</code> is used to construct DNN Net for computation.
The <code class="language-plaintext highlighter-rouge">forwardToLayer</code> function is backed by specific backend node which does the concrete computation.</p>

<h3 id="layer">Layer</h3>
<p>This class represents an abstract Neural Network Node. It is the building block of DNN Net.</p>

<h3 id="backendnode">BackendNode</h3>
<p>Different backends inherits this class to provide different backend support of Layer.</p>

<h2 id="how-nvidia-gpu-is-utilized-by-dnn-module">How NVIDIA GPU is utilized by DNN module</h2>
<p>As the above explanation indicates, cuda4dnn submodule provides CUDA backend support for DNN module. 
It implements abstract interfaces like BackendNode, and wrap concrete computation code inside.</p>

<h3 id="example-relu">Example: ReLU</h3>
<p>ReLU is a typical activation function defined as: $f(x) = max(0,x)$</p>

<p>Let’s see how ReLU is implemented in cuda4dnn:</p>
<h4 id="reluop">ReLUOp</h4>
<p>ReLUOp is an implementation of CUDABackendNode.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source file: opencv/modules/dnn/src/cuda4dnn/primitives/activation.hpp</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">class</span> <span class="nc">ReLUOp</span> <span class="k">final</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CUDABackendNode</span> <span class="p">{</span>
<span class="nl">public:</span>
    <span class="p">...</span>
    <span class="kt">void</span> <span class="n">forward</span><span class="p">(</span>
        <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">cv</span><span class="o">::</span><span class="n">Ptr</span><span class="o">&lt;</span><span class="n">BackendWrapper</span><span class="o">&gt;&gt;&amp;</span> <span class="n">inputs</span><span class="p">,</span>
        <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">cv</span><span class="o">::</span><span class="n">Ptr</span><span class="o">&lt;</span><span class="n">BackendWrapper</span><span class="o">&gt;&gt;&amp;</span> <span class="n">outputs</span><span class="p">,</span>
        <span class="n">csl</span><span class="o">::</span><span class="n">Workspace</span><span class="o">&amp;</span> <span class="n">workspace</span><span class="p">)</span> <span class="k">override</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">inputs</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="n">kernels</span><span class="o">::</span><span class="n">relu</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">input</span><span class="p">,</span> <span class="n">slope</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>
<p>When forward function is called, ReLUOp class proxies computation to <code class="language-plaintext highlighter-rouge">kernels::relu</code>. Definition of kernels::relu is as follows:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/opencv/opencv/blob/8808aaccffaec43d5d276af493ff408d81d4593c/modules/dnn/src/cuda/activations.cu#L126</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="kt">void</span> <span class="nf">relu</span><span class="p">(</span><span class="k">const</span> <span class="n">Stream</span><span class="o">&amp;</span> <span class="n">stream</span><span class="p">,</span> <span class="n">Span</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">output</span><span class="p">,</span> <span class="n">View</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">input</span><span class="p">,</span> <span class="n">T</span> <span class="n">slope</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">generic_op</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">relu_functor</span><span class="o">&gt;</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">input</span><span class="p">,</span> <span class="n">slope</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="cuda-kernel">CUDA kernel</h4>
<p>A CUDA kernel of type <code class="language-plaintext highlighter-rouge">relu_functor</code> will then be generated by following function:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/opencv/opencv/blob/8808aaccffaec43d5d276af493ff408d81d4593c/modules/dnn/src/cuda/activations.cu#L30</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="k">class</span> <span class="nc">Functor</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">N</span><span class="p">,</span> <span class="k">class</span> <span class="o">...</span><span class="nc">FunctorArgs</span><span class="p">&gt;</span>
<span class="n">__global__</span> <span class="kt">void</span> <span class="nf">generic_op_vec</span><span class="p">(</span><span class="n">Span</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">output</span><span class="p">,</span> <span class="n">View</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">input</span><span class="p">,</span> <span class="n">FunctorArgs</span> <span class="p">...</span><span class="n">functorArgs</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">using</span> <span class="n">vector_type</span> <span class="o">=</span> <span class="n">get_vector_type_t</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">N</span><span class="o">&gt;</span><span class="p">;</span>

    <span class="k">auto</span> <span class="n">output_vPtr</span> <span class="o">=</span> <span class="n">vector_type</span><span class="o">::</span><span class="n">get_pointer</span><span class="p">(</span><span class="n">output</span><span class="p">.</span><span class="n">data</span><span class="p">());</span>
    <span class="k">auto</span> <span class="n">input_vPtr</span> <span class="o">=</span> <span class="n">vector_type</span><span class="o">::</span><span class="n">get_pointer</span><span class="p">(</span><span class="n">input</span><span class="p">.</span><span class="n">data</span><span class="p">());</span>

    <span class="n">Functor</span> <span class="n">functor</span><span class="p">(</span><span class="n">functorArgs</span><span class="p">...);</span>

    <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">i</span> <span class="o">:</span> <span class="n">grid_stride_range</span><span class="p">(</span><span class="n">output</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">/</span> <span class="n">vector_type</span><span class="o">::</span><span class="n">size</span><span class="p">()))</span> <span class="p">{</span>
        <span class="n">vector_type</span> <span class="n">vec</span><span class="p">;</span>
        <span class="n">v_load</span><span class="p">(</span><span class="n">vec</span><span class="p">,</span> <span class="n">input_vPtr</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">vector_type</span><span class="o">::</span><span class="n">size</span><span class="p">();</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
            <span class="n">vec</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">functor</span><span class="p">(</span><span class="n">vec</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
        <span class="n">v_store</span><span class="p">(</span><span class="n">output_vPtr</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vec</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="launch-kernel">Launch kernel</h4>
<p>Finally, the kernel will be launched by following function:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/opencv/opencv/blob/8808aaccffaec43d5d276af493ff408d81d4593c/modules/dnn/src/cuda/execution.hpp#L64</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">Kernel</span><span class="p">,</span> <span class="k">typename</span> <span class="o">...</span><span class="nc">Args</span><span class="p">&gt;</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="nf">launch_kernel</span><span class="p">(</span><span class="n">Kernel</span> <span class="n">kernel</span><span class="p">,</span> <span class="n">Args</span> <span class="p">...</span><span class="n">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">auto</span> <span class="n">policy</span> <span class="o">=</span> <span class="n">make_policy</span><span class="p">(</span><span class="n">kernel</span><span class="p">);</span>
    <span class="n">kernel</span> <span class="o">&lt;&lt;&lt;</span><span class="n">policy</span><span class="p">.</span><span class="n">grid</span><span class="p">,</span> <span class="n">policy</span><span class="p">.</span><span class="n">block</span><span class="o">&gt;&gt;&gt;</span> <span class="p">(</span><span class="n">args</span><span class="p">...);</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="relu-functor">ReLU Functor</h4>
<p>Let’s see what the core ReLU logic looks like in relu_functor:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source: https://github.com/opencv/opencv/blob/d981d04c76821037f745a3684533e753d6951e21/modules/dnn/src/cuda/functors.hpp#L92</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">struct</span> <span class="nc">relu_functor</span> <span class="p">{</span>
    <span class="n">__device__</span> <span class="n">relu_functor</span><span class="p">(</span><span class="n">T</span> <span class="n">slope_</span><span class="p">)</span> <span class="o">:</span> <span class="n">slope</span><span class="p">{</span><span class="n">slope_</span><span class="p">}</span> <span class="p">{</span> <span class="p">}</span>
    <span class="n">__device__</span> <span class="n">T</span> <span class="nf">operator</span><span class="p">()(</span><span class="n">T</span> <span class="n">value</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">value</span> <span class="o">&gt;=</span> <span class="n">T</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="n">value</span> <span class="o">:</span> <span class="n">slope</span> <span class="o">*</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">T</span> <span class="n">slope</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We can see that cuda4dnn module actually implements <strong>Leaky ReLU</strong> variant. The signal can be leaked “backward” when input is from negative direction.</p>

<h2 id="further-reading">Further Reading</h2>
<ol>
  <li><a href="https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp">Sample source code</a></li>
</ol>]]></content><author><name></name></author><category term="ai" /><category term="computer-vision" /><category term="gpu" /><category term="C/C++" /><summary type="html"><![CDATA[Process walk-through and mechanism analysis of OpenCV DNN module.]]></summary></entry><entry><title type="html">Exploration of Linux cgroups</title><link href="https://blog.labxq.com/os/virtualization/2020/08/27/linux-cgroups-exploration.html" rel="alternate" type="text/html" title="Exploration of Linux cgroups" /><published>2020-08-27T00:00:00-07:00</published><updated>2020-08-27T00:00:00-07:00</updated><id>https://blog.labxq.com/os/virtualization/2020/08/27/linux-cgroups-exploration</id><content type="html" xml:base="https://blog.labxq.com/os/virtualization/2020/08/27/linux-cgroups-exploration.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>cgroups is a Linux kernel feature which isolates and limits computer resources(e.g CPU, memory, disk, network, etc).</p>

<p>It is the cornerstone of hottest containerization/orchestration technologies including Docker, Kubernetes, etc.</p>

<h2 id="background">Background</h2>
<p>The essence of any virtualization technique is about isolation and management of something, cgroup is of no exception.</p>

<p>Like the invention of process implements management/isolation of machine code execution, cgroup implements management/isolation of a group of processes.</p>

<h2 id="real-world-example">Real-world example</h2>
<p>I was working on an old service platform before. The whole stack was developed since prehistory when containerization was not trending yet. Each service was running as a plain executable inside a VM.</p>

<p>At one time I found that low-memory alerts was constantly triggered by the service. After investigation, turned out that a network issue triggered many error logs. Embedded fluentd log agent consumed too much logs and ate up all RAM processing logs.</p>

<p>We can see that this system is quite fragile, even logging process can crash the whole service. If cgroup-based container and appropriate resource limit is applied, we can confine the problem inside log container and prevent logging issue from affecting service’s main logic.</p>

<h2 id="play-around-linux-command">Play around Linux command</h2>
<p>TL;DR. Let’s play with cgroups command to get a more intuitive impression.</p>
<h3 id="find-cgroup-of-a-process">Find cgroup of a process</h3>
<p>cgroup in essence is an attribute of a process. Thus cgroup info can be found inside /proc/PID/ directory.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> /proc/self/cgroup
</code></pre></div></div>
<p>Sample output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2:blkio:/init.scope
1:name=systemd:/init.scope
0::/init.scope
</code></pre></div></div>
<h3 id="list-all-cgroups">List all cgroups</h3>
<p>Use <code class="language-plaintext highlighter-rouge">systemctl status</code> to get cgroups hierarchy. The output is like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CGroup: /
        ├─user.slice 
        │ └─user-1000.slice 
        │   ├─user@1000.service 
        │   │ ├─gnome-shell-wayland.service 
        │   │ │ ├─ 1129 /usr/bin/gnome-shell
        │   │ ├─gnome-terminal-server.service 
        │   
        ├─init.scope 
        │ └─1 /sbin/init
        └─system.slice 
            ├─systemd-udevd.service 
            │ └─285 /usr/lib/systemd/systemd-udevd
            ├─systemd-journald.service 
            │ └─272 /usr/lib/systemd/systemd-journald
            ├─NetworkManager.service 
            │ └─656 /usr/bin/NetworkManager --no-daemon
</code></pre></div></div>
<p>We can see that there are 3 big category: init, system and user.</p>
<h3 id="find-resource-usage-of-cgroup">Find resource usage of cgroup</h3>
<p>Use <code class="language-plaintext highlighter-rouge">systemd-cgtop</code> to find resource usage of each group.
Sample output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Control Group                 Tasks   %CPU   Memory  Input/s Output/s
 /                            2031   76.8    17.7G        -        -
user.slice                    1660   64.1    14.7G        -        -
system.slice                   196    3.4     2.4G        -        -

</code></pre></div></div>

<h2 id="how-cgroup-works-internally">How cgroup works internally</h2>
<h3 id="file-based-design">File-based design</h3>
<p>Like most Linux components, cgroup follows the famous rule of Unix: <strong>everything is a file</strong>.
Linux creates a filesytem in <code class="language-plaintext highlighter-rouge">/sys/fs/cgroup</code> to represent cgroup. The hierarchy of cgroup mirrors structure of this directory.</p>
<h3 id="controller">Controller</h3>
<p>Each folder in cgroup filesystem is called a <strong>Controller</strong>. I.e. cpu controller is just the folder <code class="language-plaintext highlighter-rouge">/sys/fs/cgroup/cpu</code>.
If you would like to use a controller, just mount the directory to cgroup filesystem:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mount <span class="nt">-t</span> cgroup <span class="nt">-o</span> cpu none /sys/fs/cgroup/cpu
</code></pre></div></div>
<h4 id="common-controllers">Common controllers</h4>
<h5 id="cpu">cpu</h5>
<p>This controller limits CPU time a process can use.</p>
<h5 id="cpuacct">cpuacct</h5>
<p>cgroup also provides stats of the group.</p>
<h5 id="memory">memory</h5>
<p>This controller limits memory used by process.</p>

<h3 id="move-a-process-to-cgroup">Move a process to cgroup</h3>
<p>All processes in a cgroup is stored in <code class="language-plaintext highlighter-rouge">cgroup.proc</code>. Just write the pid to this file to add a process to cgroup.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="nv">$$</span> <span class="o">&gt;</span> /sys/fs/cgroup/cpu/cg1/cgroup.procs
</code></pre></div></div>

<h2 id="kernel-code-analysis">Kernel code analysis</h2>
<p>We can not fully understand cgroup without reading source code directly. Let’s dive into Linux kernel source code to see how cgroup is implemented.</p>
<h3 id="source-file">source file</h3>
<p>The cgroup source code is located at: <a href="https://code.woboq.org/linux/linux/kernel/cgroup/">linux/linux/kernel/cgroup/</a></p>

<h3 id="overview">Overview</h3>
<p>Kernel basically needs to do 2 things about cgroup:</p>
<ol>
  <li>In linux/init/main.c, call cgroup_init() to read/initialize root cgroups at system boot;</li>
  <li>For each process created, make sure it is assigned appropriate cgroup.</li>
</ol>

<h3 id="data-structure">data structure</h3>
<h4 id="cgroup">cgroup</h4>
<p>Info in cgroup filesystem is loaded into this data structure.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">cgroup</span> <span class="p">{</span>
    <span class="p">...</span>
    <span class="kt">int</span> <span class="n">level</span><span class="p">;</span>
    <span class="cm">/* Maximum allowed descent tree depth */</span>
    <span class="kt">int</span> <span class="n">max_depth</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">nr_descendants</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">nr_dying_descendants</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">max_descendants</span><span class="p">;</span>

    <span class="k">struct</span> <span class="nc">kernfs_node</span> <span class="o">*</span><span class="n">kn</span><span class="p">;</span>		<span class="cm">/* cgroup kernfs entry */</span>
    <span class="k">struct</span> <span class="nc">cgroup_file</span> <span class="n">procs_file</span><span class="p">;</span>	<span class="cm">/* handle for "cgroup.procs" */</span>
    <span class="k">struct</span> <span class="nc">cgroup_file</span> <span class="n">events_file</span><span class="p">;</span>	<span class="cm">/* handle for "cgroup.events" */</span>
    <span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>

<h4 id="cgroup_subsyscss">cgroup_subsys(css)</h4>
<p>This is one of the core data structure of cgroup implementation. It represents a specific controller.<br />
Following is the simplified cgroup_subsys struct definition(some detailed code removed):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">cgroup_subsys</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">css_alloc</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">parent_css</span><span class="p">);</span>
	<span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">css_online</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">css</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">css_offline</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">css</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">css_released</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">css</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">css_free</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">css</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">css_reset</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">css</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">fork</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">exit</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">release</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">);</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">bind</span><span class="p">)(</span><span class="k">struct</span> <span class="nc">cgroup_subsys_state</span> <span class="o">*</span><span class="n">root_css</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The most important function is <code class="language-plaintext highlighter-rouge">fork</code>, which is used by kernel to assign necessary cgroup to process.</p>

<h3 id="example-cpuset">Example: cpuset</h3>
<p>cpuset is one typical cgroup controller. It controls processor placement(a.k.a. <a href="https://en.wikipedia.org/wiki/Processor_affinity">Processor affinity</a>) of process. 
A typical flow of forking a new process with cpuset cgroup is like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_do_fork(linux/kernel/fork.c)
↓
cgroup_fork
↓
cpuset_fork
</code></pre></div></div>

<h4 id="_do_fork">_do_fork</h4>
<p>_do_fork is the backend of <code class="language-plaintext highlighter-rouge">fork</code> system call.
The main logic is copying content of current process to a new process.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">_do_fork</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">clone_flags</span><span class="p">,</span>
	      <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">stack_start</span><span class="p">,</span>
	      <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">stack_size</span><span class="p">,</span>
	      <span class="kt">int</span> <span class="n">__user</span> <span class="o">*</span><span class="n">parent_tidptr</span><span class="p">,</span>
	      <span class="kt">int</span> <span class="n">__user</span> <span class="o">*</span><span class="n">child_tidptr</span><span class="p">,</span>
	      <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">tls</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="n">p</span> <span class="o">=</span> <span class="n">copy_process</span><span class="p">(</span><span class="n">clone_flags</span><span class="p">,</span> <span class="n">stack_start</span><span class="p">,</span> <span class="n">stack_size</span><span class="p">,</span>
			 <span class="n">child_tidptr</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">trace</span><span class="p">,</span> <span class="n">tls</span><span class="p">,</span> <span class="n">NUMA_NO_NODE</span><span class="p">);</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For <code class="language-plaintext highlighter-rouge">copy_process</code>, basically we need to do some configuration first, then schedule the fork process.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">__latent_entropy</span> <span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="nf">copy_process</span><span class="p">(</span>
					<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">clone_flags</span><span class="p">,</span>
					<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">stack_start</span><span class="p">,</span>
					<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">stack_size</span><span class="p">,</span>
					<span class="kt">int</span> <span class="n">__user</span> <span class="o">*</span><span class="n">child_tidptr</span><span class="p">,</span>
					<span class="k">struct</span> <span class="nc">pid</span> <span class="o">*</span><span class="n">pid</span><span class="p">,</span>
					<span class="kt">int</span> <span class="n">trace</span><span class="p">,</span>
					<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">tls</span><span class="p">,</span>
					<span class="kt">int</span> <span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="n">cgroup_fork</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
    <span class="p">...</span>
    <span class="n">retval</span> <span class="o">=</span> <span class="n">sched_fork</span><span class="p">(</span><span class="n">clone_flags</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="cgroup_fork">cgroup_fork</h4>
<p><code class="language-plaintext highlighter-rouge">cgroup_fork</code> does initialization of cgroup data structure, the main cgroup logic is inside <code class="language-plaintext highlighter-rouge">cgroup_post_fork</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">cgroup_post_fork</span><span class="p">(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">child</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="nc">cgroup_subsys</span> <span class="o">*</span><span class="n">ss</span><span class="p">;</span>
    <span class="n">do_each_subsys_mask</span><span class="p">(</span><span class="n">ss</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">have_fork_callback</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">ss</span><span class="o">-&gt;</span><span class="n">fork</span><span class="p">(</span><span class="n">child</span><span class="p">);</span>
    <span class="p">}</span> <span class="n">while_each_subsys_mask</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As code indicates, fork function of all subsystem(controller) will be called. cpuset is one subsystem with its own version of cgroup_subsys that has all functions implemented:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">cgroup_subsys</span> <span class="n">cpuset_cgrp_subsys</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">...</span>
    <span class="p">.</span><span class="n">css_alloc</span>	<span class="o">=</span> <span class="n">cpuset_css_alloc</span><span class="p">,</span>
    <span class="p">.</span><span class="n">css_free</span>	<span class="o">=</span> <span class="n">cpuset_css_free</span><span class="p">,</span>
    <span class="p">.</span><span class="n">fork</span>		<span class="o">=</span> <span class="n">cpuset_fork</span><span class="p">,</span>
    <span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>

<h4 id="cpuset_fork">cpuset_fork</h4>
<p>The cpuset_fork implementation is as follows:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source file: linux/kernel/cgroup/cpuset.c</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">cpuset_fork</span><span class="p">(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">task</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">task_css_is_root</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">cpuset_cgrp_id</span><span class="p">))</span>
		<span class="k">return</span><span class="p">;</span>
	<span class="n">set_cpus_allowed_ptr</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">cpus_allowed</span><span class="p">);</span>
	<span class="n">task</span><span class="o">-&gt;</span><span class="n">mems_allowed</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">mems_allowed</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Finally, we reach our travel destination, <code class="language-plaintext highlighter-rouge">set_cpus_allowed_ptr</code>. This is the core logic of what cpuset is supposed to do: change process’s CPU affinity.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//source file: linux/kernel/sched/core.c</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">__set_cpus_allowed_ptr</span><span class="p">(</span><span class="k">struct</span> <span class="nc">task_struct</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span>
				  <span class="k">const</span> <span class="k">struct</span> <span class="nc">cpumask</span> <span class="o">*</span><span class="n">new_mask</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">check</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="k">struct</span> <span class="nc">cpumask</span> <span class="o">*</span><span class="n">cpu_valid_mask</span> <span class="o">=</span> <span class="n">cpu_active_mask</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">dest_cpu</span><span class="p">;</span>
    <span class="n">dest_cpu</span> <span class="o">=</span> <span class="n">cpumask_any_and</span><span class="p">(</span><span class="n">cpu_valid_mask</span><span class="p">,</span> <span class="n">new_mask</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">task_running</span><span class="p">(</span><span class="n">rq</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span> <span class="o">||</span> <span class="n">p</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">==</span> <span class="n">TASK_WAKING</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="nc">migration_arg</span> <span class="n">arg</span> <span class="o">=</span> <span class="p">{</span> <span class="n">p</span><span class="p">,</span> <span class="n">dest_cpu</span> <span class="p">};</span>
        <span class="cm">/* Need help from migration thread: drop lock and wait. */</span>
        <span class="n">task_rq_unlock</span><span class="p">(</span><span class="n">rq</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rf</span><span class="p">);</span>
        <span class="n">stop_one_cpu</span><span class="p">(</span><span class="n">cpu_of</span><span class="p">(</span><span class="n">rq</span><span class="p">),</span> <span class="n">migration_cpu_stop</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">);</span>
        <span class="n">tlb_migrate_finish</span><span class="p">(</span><span class="n">p</span><span class="o">-&gt;</span><span class="n">mm</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">task_on_rq_queued</span><span class="p">(</span><span class="n">p</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">rq</span> <span class="o">=</span> <span class="n">move_queued_task</span><span class="p">(</span><span class="n">rq</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rf</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">dest_cpu</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In general, above code finds dest_cpu according to cpumask. Then stop current cpu and reschedule to migrate process to destination CPU.</p>

<h3 id="comments-are-welcomed">Comments are welcomed!</h3>
<p>You know, there are tons of code in Linux repository… The above code analysis is just an overview and may not be accurate at some places. Please leave a comment if you find error in the above analysis.</p>

<h2 id="further-reading">Further Reading</h2>
<p>If you would like to learn more, see:</p>
<ol>
  <li>https://wiki.archlinux.org/index.php/cgroups</li>
  <li>https://man7.org/linux/man-pages/man7/cgroups.7.html</li>
  <li>https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt</li>
  <li>https://www.kernel.org/doc/Documentation/cgroup-v2.txt</li>
</ol>]]></content><author><name></name></author><category term="os" /><category term="virtualization" /><category term="linux" /><category term="docker" /><category term="kubernetes" /><summary type="html"><![CDATA[Introduction, usage, mechanism and analysis of Linux cgroups.]]></summary></entry></feed>