Exploration of Linux cgroups
Introduction
cgroups is a Linux kernel feature which isolates and limits computer resources(e.g CPU, memory, disk, network, etc).
It is the cornerstone of hottest containerization/orchestration technologies including Docker, Kubernetes, etc.
Background
The essence of any virtualization technique is about isolation and management of something, cgroup is of no exception.
Like the invention of process implements management/isolation of machine code execution, cgroup implements management/isolation of a group of processes.
Real-world example
I was working on an old service platform before. The whole stack was developed since prehistory when containerization was not trending yet. Each service was running as a plain executable inside a VM.
At one time I found that low-memory alerts was constantly triggered by the service. After investigation, turned out that a network issue triggered many error logs. Embedded fluentd log agent consumed too much logs and ate up all RAM processing logs.
We can see that this system is quite fragile, even logging process can crash the whole service. If cgroup-based container and appropriate resource limit is applied, we can confine the problem inside log container and prevent logging issue from affecting service’s main logic.
Play around Linux command
TL;DR. Let’s play with cgroups command to get a more intuitive impression.
Find cgroup of a process
cgroup in essence is an attribute of a process. Thus cgroup info can be found inside /proc/PID/ directory.
cat /proc/self/cgroup
Sample output:
2:blkio:/init.scope
1:name=systemd:/init.scope
0::/init.scope
List all cgroups
Use systemctl status
to get cgroups hierarchy. The output is like:
CGroup: /
├─user.slice
│ └─user-1000.slice
│ ├─[email protected]
│ │ ├─gnome-shell-wayland.service
│ │ │ ├─ 1129 /usr/bin/gnome-shell
│ │ ├─gnome-terminal-server.service
│
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─systemd-udevd.service
│ └─285 /usr/lib/systemd/systemd-udevd
├─systemd-journald.service
│ └─272 /usr/lib/systemd/systemd-journald
├─NetworkManager.service
│ └─656 /usr/bin/NetworkManager --no-daemon
We can see that there are 3 big category: init, system and user.
Find resource usage of cgroup
Use systemd-cgtop
to find resource usage of each group.
Sample output:
Control Group Tasks %CPU Memory Input/s Output/s
/ 2031 76.8 17.7G - -
user.slice 1660 64.1 14.7G - -
system.slice 196 3.4 2.4G - -
How cgroup works internally
File-based design
Like most Linux components, cgroup follows the famous rule of Unix: everything is a file.
Linux creates a filesytem in /sys/fs/cgroup
to represent cgroup. The hierarchy of cgroup mirrors structure of this directory.
Controller
Each folder in cgroup filesystem is called a Controller. I.e. cpu controller is just the folder /sys/fs/cgroup/cpu
.
If you would like to use a controller, just mount the directory to cgroup filesystem:
mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
Common controllers
cpu
This controller limits CPU time a process can use.
cpuacct
cgroup also provides stats of the group.
memory
This controller limits memory used by process.
Move a process to cgroup
All processes in a cgroup is stored in cgroup.proc
. Just write the pid to this file to add a process to cgroup.
echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
Kernel code analysis
We can not fully understand cgroup without reading source code directly. Let’s dive into Linux kernel source code to see how cgroup is implemented.
source file
The cgroup source code is located at: linux/linux/kernel/cgroup/
Overview
Kernel basically needs to do 2 things about cgroup:
- In linux/init/main.c, call cgroup_init() to read/initialize root cgroups at system boot;
- For each process created, make sure it is assigned appropriate cgroup.
data structure
cgroup
Info in cgroup filesystem is loaded into this data structure.
struct cgroup {
...
int level;
/* Maximum allowed descent tree depth */
int max_depth;
int nr_descendants;
int nr_dying_descendants;
int max_descendants;
struct kernfs_node *kn; /* cgroup kernfs entry */
struct cgroup_file procs_file; /* handle for "cgroup.procs" */
struct cgroup_file events_file; /* handle for "cgroup.events" */
...
};
cgroup_subsys(css)
This is one of the core data structure of cgroup implementation. It represents a specific controller.
Following is the simplified cgroup_subsys struct definition(some detailed code removed):
struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
int (*css_online)(struct cgroup_subsys_state *css);
void (*css_offline)(struct cgroup_subsys_state *css);
void (*css_released)(struct cgroup_subsys_state *css);
void (*css_free)(struct cgroup_subsys_state *css);
void (*css_reset)(struct cgroup_subsys_state *css);
void (*fork)(struct task_struct *task);
void (*exit)(struct task_struct *task);
void (*release)(struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);
};
The most important function is fork
, which is used by kernel to assign necessary cgroup to process.
Example: cpuset
cpuset is one typical cgroup controller. It controls processor placement(a.k.a. Processor affinity) of process. A typical flow of forking a new process with cpuset cgroup is like:
_do_fork(linux/kernel/fork.c)
↓
cgroup_fork
↓
cpuset_fork
_do_fork
_do_fork is the backend of fork
system call.
The main logic is copying content of current process to a new process.
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
{
...
p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
...
}
For copy_process
, basically we need to do some configuration first, then schedule the fork process.
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
...
cgroup_fork(p);
...
retval = sched_fork(clone_flags, p);
...
}
cgroup_fork
cgroup_fork
does initialization of cgroup data structure, the main cgroup logic is inside cgroup_post_fork
:
void cgroup_post_fork(struct task_struct *child) {
struct cgroup_subsys *ss;
do_each_subsys_mask(ss, i, have_fork_callback) {
ss->fork(child);
} while_each_subsys_mask();
}
As code indicates, fork function of all subsystem(controller) will be called. cpuset is one subsystem with its own version of cgroup_subsys that has all functions implemented:
struct cgroup_subsys cpuset_cgrp_subsys = {
...
.css_alloc = cpuset_css_alloc,
.css_free = cpuset_css_free,
.fork = cpuset_fork,
...
};
cpuset_fork
The cpuset_fork implementation is as follows:
//source file: linux/kernel/cgroup/cpuset.c
static void cpuset_fork(struct task_struct *task)
{
if (task_css_is_root(task, cpuset_cgrp_id))
return;
set_cpus_allowed_ptr(task, ¤t->cpus_allowed);
task->mems_allowed = current->mems_allowed;
}
Finally, we reach our travel destination, set_cpus_allowed_ptr
. This is the core logic of what cpuset is supposed to do: change process’s CPU affinity.
//source file: linux/kernel/sched/core.c
static int __set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask, bool check)
{
const struct cpumask *cpu_valid_mask = cpu_active_mask;
unsigned int dest_cpu;
dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
if (task_running(rq, p) || p->state == TASK_WAKING) {
struct migration_arg arg = { p, dest_cpu };
/* Need help from migration thread: drop lock and wait. */
task_rq_unlock(rq, p, &rf);
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
tlb_migrate_finish(p->mm);
return 0;
} else if (task_on_rq_queued(p)) {
rq = move_queued_task(rq, &rf, p, dest_cpu);
}
}
In general, above code finds dest_cpu according to cpumask. Then stop current cpu and reschedule to migrate process to destination CPU.
Comments are welcomed!
You know, there are tons of code in Linux repository… The above code analysis is just an overview and may not be accurate at some places. Please leave a comment if you find error in the above analysis.
Further Reading
If you would like to learn more, see:
- https://wiki.archlinux.org/index.php/cgroups
- https://man7.org/linux/man-pages/man7/cgroups.7.html
- https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
- https://www.kernel.org/doc/Documentation/cgroup-v2.txt
Comments