Understanding Linux Cgroups

Cgroups are one of the Linux Kernel features making containers a reality. In this post, we are going to learn what Linux Cgroups are and how they work through simple explanations, illustrations and examples.

Understanding Linux Cgroups
Photo by Steffen Triekels / Unsplash

What is a cgroup

Linux cgroup or control group is a mechanism that can be used to hierarchically organize processes and control the amount of system resources used inside that processes hierarchy (cpu, ram, read/write speed on devices, etc).

linux-cgroups-overview.webp

Resources controls are achieved through specific controllers that should be enabled. Here is an overview of some of the controllers, associated to system resources they control:

linux-cgroups-controllers.webp

Cgroup implementation

Cgroup implementation inside the Linux Kernel can be divided into two parts:

  • core code: hierarchical grouping of processes and other stuff not imlemented in controllers

  • controllers: separate subsystems for each resource type (cpu, ram, etc), implementing resource tracking and limits along the hierarchy

Cgroup versions

There are currently two versions of cgroup: version 1 and version 2. Version 1 was initially released in Linux 2.6.24 and over time, after problems related to inconsistencies between controllers and the complex management of the cgroup hierarchy, version 2 was created and officially released in Linux 4.5 to fix that.

Cgroup version 2 is intended to replace version 1 but for now, the version 1 continues to exist and is unlekely to be removed for compatibility reasons. Controllers available in version 1 are progressively ported to version 2. Missing controllers on version 2 can still be used through version 1, while other controllers in version 2 are in use.

Cgroup pseudo-filesystem (cgroupfs)

Cgroup functionalities are exposed to users through the cgroupfs pseudo filesystem, which is by default mounted at '/sys/fs/cgroup' although it can be mounted elsewhere. The cgroup version used by default in recent Linux distributions is version 2.

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

If you wonder how to mount the cgroupfs somewhere, here is the command:

mount -t cgroup2 none $MOUNT_POINT

Through the cgroupfs, we can enable/disable cgroup controllers, create/remove cgroups, add processes into cgroups to control specific resources usage. We will see that in details in the Cgroup manipulation examples section.

Cgroup hierarchy

Cgroups are organized into a parent-child hierarchy. First there is the root cgroup, the parent of any cgroup present on the system. It is natively provided by the cgroupfs, which is mounted by default at '/sys/fs/cgroup'.

$ ls /sys/fs/cgroup/
cgroup.controllers      cgroup.stat             cpu.stat               dev-mqueue.mount  io.pressure       memory.pressure                sys-fs-fuse-connections.mount  system.slice
cgroup.max.depth        cgroup.subtree_control  cpuset.cpus.effective  init.scope        io.prio.class     memory.stat                    sys-kernel-config.mount        user.slice
cgroup.max.descendants  cgroup.threads          cpuset.mems.effective  io.cost.model     io.stat           misc.capacity                  sys-kernel-debug.mount
cgroup.procs            cpu.pressure            dev-hugepages.mount    io.cost.qos       memory.numa_stat  proc-sys-fs-binfmt_misc.mount  sys-kernel-tracing.mount

Creating a directory inside the root cgroup creates a child cgroup, which in turn can have children and so on. Resources distribution across a cgroup hierarchy is top-down (from parents to children).

Only controllers enabled in a parent cgroup can be configured in a child cgroup (enabled, disabled, used to limit resources, etc). This allows resources distribution to children. For more about cgroups parent-to-children resources distribution schemes, have a look at cgroup-resources-distribution-models

Cgroup interface files

Files inside the cgroupfs are cgroup interface files. They are read-write and read-only files that are used to configure the cgroup or get configuration information and statistics about the cgroup and its controllers.

For instance, for a given cgroup, through its read-only 'cgroup.stat' file, we can get the number of its descedent cgroups (children, grandchildren and so on). Also, through the 'cgroup.max.descendants' interface file for instance, we can set the cgroup's maximum number of descendant cgroups.

The 'cgroup.*' files are cgroup core interface files and the others are controllers specific interface files. The 'cpu.*' files for instance are interface files for the CPU controller. Directories represent children cgroups.

Cgroup utilities

In addition to playing inside the cgroupfs directory directly, to manipulate cgroups, there are command line tools we can use to configure cgroups and their associated controllers behaviors. Those tools are provided by the 'cgroup-tools' package. Here is a list of few of them. Follow the links for full command manual and usage examples:

  • cgcreate - create one or more cgroups by defining one or more 'Controllers:Path' couples for each of them, through the '-g' flag. Path represents the cgroup directory path inside the cgroupfs. Controllers represent the list of controllers that should be available in mounted hierarchies (the available cgroupfs on the system) where the cgroup will be created, separated by comma. A wildcare can be used to indicate all the available Controllers.

  • cgexec - execute a program inside a specific cgroup within choosen controllers

  • cgset - set parameters for specific cgroups

  • cgget - show parameters of specific cgroups

  • cgdelete - remove cgroups

Cgroup manipulation examples

How to create a cgroup

Simply create a directory inside the cgroupfs:

$ cd /sys/fs/cgroup/
$ sudo mkdir mycgroup

or use the 'cgcreate' utility as follows:

# Syntax
# cgcreate -g Controllers:Path

# Create a cgroup called mycgroup in hierarchies
# where the cpu and memory controllers are available
$ cgcreate -g cpu,memory:mycgroup

The cgroup interface files are automatically created for the new cgroup. The interface files of the active controllers of that cgroup are also visible:

  • cpu.* files - can be used to track and control the CPU resources consumption of processes belonging to the cgroup and its descendants
  • io.* files - can be used to track and control the read and write IO speed on specific block devices, for processes belonging to the cgroup and its descendants
  • memory.* files - can be used to track and control the Memory resources consumption of processes belonging to the cgroup and its descendants
  • pids.* files - can be used to track and control the number of tasks that processes belonging to the cgroup and its descendants can create
  • cpuset.* files - can be used to track and control the usage of a specific set of CPU and memory node placement for processes belonging to the cgroup and its descendants

The list of controllers that can be used by a given cgroup are controlled by the parent cgroup's cgroup.subtree_control file. Here is the content of the newly created cgroup directory:

$ ls /sys/fs/cgroup/mycgroup/
cgroup.controllers  cgroup.max.descendants  cgroup.type    cpu.stat         cpuset.cpus            io.max         memory.current       memory.max        memory.stat          pids.current
cgroup.events       cgroup.procs            cpu.idle       cpu.uclamp.max   cpuset.cpus.effective  io.pressure    memory.events        memory.min        memory.swap.current  pids.events
cgroup.freeze       cgroup.stat             cpu.max        cpu.uclamp.min   cpuset.cpus.partition  io.prio.class  memory.events.local  memory.numa_stat  memory.swap.events   pids.max
cgroup.kill         cgroup.subtree_control  cpu.max.burst  cpu.weight       cpuset.mems            io.stat        memory.high          memory.oom.group  memory.swap.high
cgroup.max.depth    cgroup.threads          cpu.pressure   cpu.weight.nice  cpuset.mems.effective  io.weight      memory.low           memory.pressure   memory.swap.max

The resources control settings of a non root cgroup are those of its nearest cgroup ancestor if it has no specific configurations.

How to remove a cgroup

Ensure the cgroup doesn't have children cgroups or live processes (not zombies). Then, simply remove the cgroup directory from the cgroupfs:

$ cd /sys/fs/cgroup
$ sudo rmdir mycgroup

or use the 'cgdelete' utility as follows:

# Syntax
# cgdelete -g Controllers:Path

# Delete the cgroup called mycgroup in hierarchies
# where the cpu and memory controllers are available
$ cgdelete -g cpu,memory:mycgroup

How to list available cgroup controllers

Simply have a look at the cgroup's 'cgroups.controllers' file:

$ cat /sys/fs/cgroup/mycgroup/cgroup.controllers 
cpuset cpu io memory pids

That file contains the names of the controllers that are available to the cgroup. This means that the cgroup can control these controllers associated system resources for processes it manages.

That list of available controllers to the cgroup is controlled by the parent cgroup's 'cgroup.subtree_control' file:

$ cat /sys/fs/cgroup/cgroup.subtree_control 
cpuset cpu io memory pids

Let's add the 'misc' controller to that file and see the impact:

# See the available controllers for the parent cgroup
$ cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory hugetlb pids rdma misc

# Add the misc controller into the parent 
# cgroup's cgroup.subtree_control file
$ echo "+misc" > /sys/fs/cgroup/cgroup.subtree_control

# That misc controller is now available
# to the child cgroup called mycgroup
$ cat /sys/fs/cgroup/mycgroup/cgroup.controllers 
cpuset cpu io memory pids misc

An empty 'cgroup.subtree_control' file in a parent cgroup directory simply means that no controllers are available to the child. In that case, the processes of the parent cgroup plus those of the child and its eventual siblings will share system resources according to the resource control settings of the parent cgroup.

How to add processes into a cgroup

To add an existing process into a cgroup, simply add its PID to the cgroup's 'cgroup.procs' file:

$ echo 'PID' > /sys/fs/cgroup/mycgroup/cgroup.procs

or use the 'cgexec' utility to run a program inside a cgroup:

# Syntax
# cgexec -g Controllers:Path Command

# Run bash inside the cgroup called mycgroup (from /sys/fs/cgroup)
# where the cpu and memory controllers are available
$ cgexec -g cpu,memory:mycgroup bash

How to list the cgroups of a specific process

Simply have a look at the '/proc/PID/cgroup' file for that process:

# List the cgroups of the process with PID 16461
$ cat /proc/16461/cgroup 
0::/mycgroup

How to view and edit cgroup controllers parameters

Directly view and edit the cgroups interface files ('cgroups.*' files from the cgroup directory) or use the 'cgget' and 'cgset' utilities as follows:

# Syntax
# cgget [-r Param1 -r Param2 ...] CgroupName1 [CgroupName2 ...]
# cgset [-r Param1=Value1 -r Param2=Value2 ...] CgroupName1 [CgroupName2 ...]

# Show all parameters values for the cgroup called mycgroup
$ cgget cgroup
mycgroup:
cpuset.cpus.partition: member
cpuset.cpus.effective: 0-1
cpuset.mems:
cpuset.mems.effective: 0
cpuset.cpus:
cpu.weight: 100
cpu.stat: usage_usec 21205578248
        user_usec 10674775420
        system_usec 10530802828
        nr_periods 0
        nr_throttled 0
        throttled_usec 0
(...)

# Show the io.max parameter value for the cgroup called mycgroup
$ cgget -r io.max mycgroup
mycgroup:
io.max:

# Set the value of the io.max parameter for the cgroup called mycgroup
$ cgset -r io.max="8:0 rbps=max wbps=100000000 riops=max wiops=max" mycgroup

# Verify
$ cgget -r io.max mycgroup
mycgroup:
io.max: 8:0 rbps=max wbps=100000000 riops=max wiops=max

Practical example: limiting processes memory usage

Now let's have a look at a practical example of limiting processes memory usage with a cgroup. Here is the 'memory_allocator.sh' script that allocates 10MB of memory every second until 200MB:

#!/bin/bash

# Directory for memory allocation
ALLOC_DIR="/dev/shm/mem_alloc_$$"
mkdir -p "$ALLOC_DIR"

# Clean up on exit
trap 'echo "Cleaning up..."; rm -rf "$ALLOC_DIR"' EXIT

# Allocate 10MB every second until 200MB is reached
for i in {1..20}; do
  dd if=/dev/zero of="$ALLOC_DIR/block_$i" bs=1M count=10 &>/dev/null
  allocated_mb=$((i * 10))
  echo "Allocated ${allocated_mb}MB"
  sleep 1
done

echo "Total 200MB allocated. Holding for 60 seconds before cleanup."
sleep 60

We are going to create a cgroup that limits memory usage to 100MB and then run the 'memory_allocator.sh' script through that cgroup and see what happens.

Let's create a new cgroup called 'mycgroup' in hierarchies where the memory controller is available.

# Create the cgroup called mycgroup.
# We only have one cgroup hierarchy in this case
# which is mounted at /sys/fs/cgroup
$ cgcreate -g memory:mycgroup

Let's list the current settings for the memory controller:

# Listing memory settings of the newly
# created cgroup called mycgroup
$ cgget mycgroup | grep memory
memory.events: low 0
memory.events.local: low 0
memory.swap.current: 0
memory.swap.max: max
memory.swap.events: high 0
memory.pressure: some avg10=0.00 avg60=0.00 avg300=0.00 total=0
memory.current: 0
memory.stat: anon 0
memory.low: 0
memory.swap.high: max
memory.numa_stat: anon N0=0
memory.min: 0
memory.oom.group: 0
memory.max: max
memory.high: max

Now let's set the memory usage limit for that cgroup to 100MB:

$ cgset -r memory.max=100000000 mycgroup

# Verify
$ cgget -r memory.max mycgroup
mycgroup:
memory.max: 99999744

Very good. Now let's run the 'memory_allocator.sh' script through that cgroup and see what happens:

$ cgexec -g memory:mycgroup ./memory_allocator.sh 
Allocated 10MB
Allocated 20MB
Allocated 30MB
Allocated 40MB
Allocated 50MB
Allocated 60MB
Allocated 70MB
Allocated 80MB
Allocated 90MB
Killed

Ah! It seems like the script has been stopped right after allocating 100MB of RAM because the cgroup limits its memory usage to 100MB.

Here is the Kernel log that indicates that the 'memory_allocator.sh' process has been killed by the Kernel Out-Of-Memory (OOM) Killer because the memory cgroup of the process was out of memory:

$ dmesg
(...)
[43362.458500] Tasks state (memory values in pages):
[43362.458505] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[43362.458514] [   2484]     0  2484     1816      701    57344        0             0 dd
[43362.458541] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=mycgroup,mems_allowed=0,oom_memcg=/mycgroup,task_memcg=/mycgroup,task=dd,pid=2484,uid=0
[43362.458611] Memory cgroup out of memory: Killed process 2484 (dd) total-vm:7264kB, anon-rss:1028kB, file-rss:1776kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0

That's all. Hope you better understand Linux cgroups now.

Want to report a mistake or ask questions ? Feel free to email me at gmkziz@hackerstack.org. I will be glad to answer.

If you like my articles, consider registering to my newsletter in order to receive the latest posts as soon as they are available.

Take care, keep learning and see you in the next post 🚀