Exploration of Linux cgroups

August 27, 2020 5 minute read

Introduction

cgroups is a Linux kernel feature which isolates and limits computer resources(e.g CPU, memory, disk, network, etc).

It is the cornerstone of hottest containerization/orchestration technologies including Docker, Kubernetes, etc.

Background

The essence of any virtualization technique is about isolation and management of something, cgroup is of no exception.

Like the invention of process implements management/isolation of machine code execution, cgroup implements management/isolation of a group of processes.

Real-world example

I was working on an old service platform before. The whole stack was developed since prehistory when containerization was not trending yet. Each service was running as a plain executable inside a VM.

At one time I found that low-memory alerts was constantly triggered by the service. After investigation, turned out that a network issue triggered many error logs. Embedded fluentd log agent consumed too much logs and ate up all RAM processing logs.

We can see that this system is quite fragile, even logging process can crash the whole service. If cgroup-based container and appropriate resource limit is applied, we can confine the problem inside log container and prevent logging issue from affecting service’s main logic.

Play around Linux command

TL;DR. Let’s play with cgroups command to get a more intuitive impression.

Find cgroup of a process

cgroup in essence is an attribute of a process. Thus cgroup info can be found inside /proc/PID/ directory.

cat /proc/self/cgroup

Sample output:

blkio:/init.scope
name=systemd:/init.scope
:/init.scope

List all cgroups

Use systemctl status to get cgroups hierarchy. The output is like:

CGroup: /
        ├─user.slice 
        │ └─user-1000.slice 
        │   ├─[email protected] 
        │   │ ├─gnome-shell-wayland.service 
        │   │ │ ├─ 1129 /usr/bin/gnome-shell
        │   │ ├─gnome-terminal-server.service 
        │   
        ├─init.scope 
        │ └─1 /sbin/init
        └─system.slice 
            ├─systemd-udevd.service 
            │ └─285 /usr/lib/systemd/systemd-udevd
            ├─systemd-journald.service 
            │ └─272 /usr/lib/systemd/systemd-journald
            ├─NetworkManager.service 
            │ └─656 /usr/bin/NetworkManager --no-daemon

We can see that there are 3 big category: init, system and user.

Find resource usage of cgroup

Use systemd-cgtop to find resource usage of each group. Sample output:

Control Group                 Tasks   %CPU   Memory  Input/s Output/s
 /                            2031   76.8    17.7G        -        -
user.slice                    1660   64.1    14.7G        -        -
system.slice                   196    3.4     2.4G        -        -

How cgroup works internally

File-based design

Like most Linux components, cgroup follows the famous rule of Unix: everything is a file. Linux creates a filesytem in /sys/fs/cgroup to represent cgroup. The hierarchy of cgroup mirrors structure of this directory.

Controller

Each folder in cgroup filesystem is called a Controller. I.e. cpu controller is just the folder /sys/fs/cgroup/cpu. If you would like to use a controller, just mount the directory to cgroup filesystem:

mount -t cgroup -o cpu none /sys/fs/cgroup/cpu

Common controllers

cpu

This controller limits CPU time a process can use.

cpuacct

cgroup also provides stats of the group.

memory

This controller limits memory used by process.

Move a process to cgroup

All processes in a cgroup is stored in cgroup.proc. Just write the pid to this file to add a process to cgroup.

echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs

Kernel code analysis

We can not fully understand cgroup without reading source code directly. Let’s dive into Linux kernel source code to see how cgroup is implemented.

source file

The cgroup source code is located at: linux/linux/kernel/cgroup/

Overview

Kernel basically needs to do 2 things about cgroup:

In linux/init/main.c, call cgroup_init() to read/initialize root cgroups at system boot;
For each process created, make sure it is assigned appropriate cgroup.

data structure

cgroup

Info in cgroup filesystem is loaded into this data structure.

struct cgroup {
    ...
    int level;
    /* Maximum allowed descent tree depth */
    int max_depth;
    int nr_descendants;
    int nr_dying_descendants;
    int max_descendants;

    struct kernfs_node *kn;		/* cgroup kernfs entry */
    struct cgroup_file procs_file;	/* handle for "cgroup.procs" */
    struct cgroup_file events_file;	/* handle for "cgroup.events" */
    ...
};

cgroup_subsys(css)

This is one of the core data structure of cgroup implementation. It represents a specific controller.
Following is the simplified cgroup_subsys struct definition(some detailed code removed):

struct cgroup_subsys {
	struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
	int (*css_online)(struct cgroup_subsys_state *css);
	void (*css_offline)(struct cgroup_subsys_state *css);
	void (*css_released)(struct cgroup_subsys_state *css);
	void (*css_free)(struct cgroup_subsys_state *css);
	void (*css_reset)(struct cgroup_subsys_state *css);
	void (*fork)(struct task_struct *task);
	void (*exit)(struct task_struct *task);
	void (*release)(struct task_struct *task);
	void (*bind)(struct cgroup_subsys_state *root_css);
};

The most important function is fork, which is used by kernel to assign necessary cgroup to process.

Example: cpuset

cpuset is one typical cgroup controller. It controls processor placement(a.k.a. Processor affinity) of process. A typical flow of forking a new process with cpuset cgroup is like:

_do_fork(linux/kernel/fork.c)
↓
cgroup_fork
↓
cpuset_fork

_do_fork

_do_fork is the backend of fork system call. The main logic is copying content of current process to a new process.

long _do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr,
	      unsigned long tls)
{
    ...
    p = copy_process(clone_flags, stack_start, stack_size,
			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
    ...
}

For copy_process, basically we need to do some configuration first, then schedule the fork process.

static __latent_entropy struct task_struct *copy_process(
					unsigned long clone_flags,
					unsigned long stack_start,
					unsigned long stack_size,
					int __user *child_tidptr,
					struct pid *pid,
					int trace,
					unsigned long tls,
					int node)
{
    ...
    cgroup_fork(p);
    ...
    retval = sched_fork(clone_flags, p);
    ...
}

cgroup_fork

cgroup_fork does initialization of cgroup data structure, the main cgroup logic is inside cgroup_post_fork:

void cgroup_post_fork(struct task_struct *child) {
    struct cgroup_subsys *ss;
    do_each_subsys_mask(ss, i, have_fork_callback) {
        ss->fork(child);
    } while_each_subsys_mask();
}

As code indicates, fork function of all subsystem(controller) will be called. cpuset is one subsystem with its own version of cgroup_subsys that has all functions implemented:

struct cgroup_subsys cpuset_cgrp_subsys = {
	...
    .css_alloc	= cpuset_css_alloc,
    .css_free	= cpuset_css_free,
    .fork		= cpuset_fork,
    ...
};

cpuset_fork

The cpuset_fork implementation is as follows:

//source file: linux/kernel/cgroup/cpuset.c
static void cpuset_fork(struct task_struct *task)
{
	if (task_css_is_root(task, cpuset_cgrp_id))
		return;
	set_cpus_allowed_ptr(task, &current->cpus_allowed);
	task->mems_allowed = current->mems_allowed;
}

Finally, we reach our travel destination, set_cpus_allowed_ptr. This is the core logic of what cpuset is supposed to do: change process’s CPU affinity.

//source file: linux/kernel/sched/core.c
static int __set_cpus_allowed_ptr(struct task_struct *p,
				  const struct cpumask *new_mask, bool check)
{
    const struct cpumask *cpu_valid_mask = cpu_active_mask;
    unsigned int dest_cpu;
    dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
    if (task_running(rq, p) || p->state == TASK_WAKING) {
        struct migration_arg arg = { p, dest_cpu };
        /* Need help from migration thread: drop lock and wait. */
        task_rq_unlock(rq, p, &rf);
        stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
        tlb_migrate_finish(p->mm);
        return 0;
    } else if (task_on_rq_queued(p)) {
        rq = move_queued_task(rq, &rf, p, dest_cpu);
    }
}

In general, above code finds dest_cpu according to cpumask. Then stop current cpu and reschedule to migrate process to destination CPU.

Comments are welcomed!

You know, there are tons of code in Linux repository… The above code analysis is just an overview and may not be accurate at some places. Please leave a comment if you find error in the above analysis.