3. Structural Constraints
3-1. Top-down
3-2. No internal tasks
-4. Other Changes
- 4-1. [Un]populated Notification
- 4-2. Other Core Changes
- 4-3. Per-Controller Changes
- 4-3-1. blkio
- 4-3-2. cpuset
- 4-3-3. memory
-5. Planned Changes
- 5-1. CAP for resource control
+4. Delegation
+ 4-1. Model of delegation
+ 4-2. Common ancestor rule
+5. Other Changes
+ 5-1. [Un]populated Notification
+ 5-2. Other Core Changes
+ 5-3. Controller File Conventions
+ 5-3-1. Format
+ 5-3-2. Control Knobs
+ 5-4. Per-Controller Changes
+ 5-4-1. io
+ 5-4-2. cpuset
+ 5-4-3. memory
+6. Planned Changes
+ 6-1. CAP for resource control
1. Background
allows mixing unified hierarchy with the traditional multiple
hierarchies in a fully backward compatible way.
-For development purposes, the following boot parameter makes all
-controllers to appear on the unified hierarchy whether supported or
-not.
-
- cgroup__DEVEL__legacy_files_on_dfl
-
A controller can be moved across hierarchies only after the controller
is no longer referenced in its current hierarchy. Because per-cgroup
controller states are destroyed asynchronously and controllers may
universal, and there are various other knobs which simply aren't
available for tasks.
-The blkio controller implicitly creates a hidden leaf node for each
+The io controller implicitly creates a hidden leaf node for each
cgroup to host the tasks. The hidden leaf has its own copies of all
the knobs with "leaf_" prefixed. While this allows equivalent control
over internal tasks, it's with serious drawbacks. It always adds an
before enabling controllers in its "cgroup.subtree_control" file.
-4. Other Changes
+4. Delegation
+
+4-1. Model of delegation
+
+A cgroup can be delegated to a less privileged user by granting write
+access of the directory and its "cgroup.procs" file to the user. Note
+that the resource control knobs in a given directory concern the
+resources of the parent and thus must not be delegated along with the
+directory.
+
+Once delegated, the user can build sub-hierarchy under the directory,
+organize processes as it sees fit and further distribute the resources
+it got from the parent. The limits and other settings of all resource
+controllers are hierarchical and regardless of what happens in the
+delegated sub-hierarchy, nothing can escape the resource restrictions
+imposed by the parent.
+
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may in the future be limited explicitly.
+
+
+4-2. Common ancestor rule
+
+On the unified hierarchy, to write to a "cgroup.procs" file, in
+addition to the usual write permission to the file and uid match, the
+writer must also have write access to the "cgroup.procs" file of the
+common ancestor of the source and destination cgroups. This prevents
+delegatees from smuggling processes across disjoint sub-hierarchies.
+
+Let's say cgroups C0 and C1 have been delegated to user U0 who created
+C00, C01 under C0 and C10 under C1 as follows.
+
+ ~~~~~~~~~~~~~ - C0 - C00
+ ~ cgroup ~ \ C01
+ ~ hierarchy ~
+ ~~~~~~~~~~~~~ - C1 - C10
+
+C0 and C1 are separate entities in terms of resource distribution
+regardless of their relative positions in the hierarchy. The
+resources the processes under C0 are entitled to are controlled by
+C0's ancestors and may be completely different from C1. It's clear
+that the intention of delegating C0 to U0 is allowing U0 to organize
+the processes under C0 and further control the distribution of C0's
+resources.
+
+On traditional hierarchies, if a task has write access to "tasks" or
+"cgroup.procs" file of a cgroup and its uid agrees with the target, it
+can move the target to the cgroup. In the above example, U0 will not
+only be able to move processes in each sub-hierarchy but also across
+the two sub-hierarchies, effectively allowing it to violate the
+organizational and resource restrictions implied by the hierarchical
+structure above C0 and C1.
+
+On the unified hierarchy, let's say U0 wants to write the pid of a
+process which has a matching uid and is currently in C10 into
+"C00/cgroup.procs". U0 obviously has write access to the file and
+migration permission on the process; however, the common ancestor of
+the source cgroup C10 and the destination cgroup C00 is above the
+points of delegation and U0 would not have write access to its
+"cgroup.procs" and thus be denied with -EACCES.
-4-1. [Un]populated Notification
+
+5. Other Changes
+
+5-1. [Un]populated Notification
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
-Unified hierarchy implements an interface file "cgroup.populated"
-which can be used to monitor whether the cgroup's subhierarchy has
-tasks in it or not. Its value is 0 if there is no task in the cgroup
-and its descendants; otherwise, 1. poll and [id]notify events are
-triggered when the value changes.
+Unified hierarchy implements "populated" field in "cgroup.events"
+interface file which can be used to monitor whether the cgroup's
+subhierarchy has tasks in it or not. Its value is 0 if there is no
+task in the cgroup and its descendants; otherwise, 1. poll and
+[id]notify events are triggered when the value changes.
This is significantly lighter and simpler and trivially allows
delegating management of subhierarchy - subhierarchy monitoring can
"notify_on_release" do not exist.
-4-2. Other Core Changes
+5-2. Other Core Changes
- None of the mount options is allowed.
- The "cgroup.clone_children" file is removed.
+- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
+ to before exiting. If the cgroup is removed before the zombie is
+ reaped, " (deleted)" is appeneded to the path.
+
+
+5-3. Controller File Conventions
+
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+ VAL0 VAL1...\n
+
+- Flat keyed files
+
+ KEY0 VAL0\n
+ KEY1 VAL1\n
+ ...
+
+- Nested keyed files
+
+ KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+ KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+ ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time. For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+ single file.
+
+- In general, the root cgroup should be exempt from resource control
+ and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+ control knob should be named "weight" and have the range [1, 10000]
+ and 100 should be the default value. The values are chosen to allow
+ enough and symmetric bias in both directions while keeping it
+ intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+ limit, the control knobs should be named "min" and "max"
+ respectively. If a controller implements best effort resource
+ gurantee and/or limit, the control knobs should be named "low" and
+ "high" respectively.
+
+ In the above four control files, the special token "max" should be
+ used to represent upward infinity for both reading and writing.
+
+- If a setting has configurable default value and specific overrides,
+ the default settings should be keyed with "default" and appear as
+ the first entry in the file. Specific entries can use "default" as
+ its value to indicate inheritance of the default value.
+
+- For events which are not very high frequency, an interface file
+ "events" should be created which lists event key value pairs.
+ Whenever a notifiable event happens, file modified event should be
+ generated on the file.
+
+
+5-4. Per-Controller Changes
+
+5-4-1. io
+
+- blkio is renamed to io. The interface is overhauled anyway. The
+ new name is more in line with the other two major controllers, cpu
+ and memory, and better suited given that it may be used for cgroup
+ writeback without involving block layer.
+
+- Everything including stat is always hierarchical making separate
+ recursive stat files pointless and, as no internal node can have
+ tasks, leaf weights are meaningless. The operation model is
+ simplified and the interface is overhauled accordingly.
+
+ io.stat
+
+ The stat file. The reported stats are from the point where
+ bio's are issued to request_queue. The stats are counted
+ independent of which policies are enabled. Each line in the
+ file follows the following format. More fields may later be
+ added at the end.
+
+ $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+
+ io.weight
+
+ The weight setting, currently only available and effective if
+ cfq-iosched is in use for the target device. The weight is
+ between 1 and 10000 and defaults to 100. The first line
+ always contains the default weight in the following format to
+ use when per-device setting is missing.
+
+ default $WEIGHT
+
+ Subsequent lines list per-device weights of the following
+ format.
+
+ $MAJ:$MIN $WEIGHT
+
+ Writing "$WEIGHT" or "default $WEIGHT" changes the default
+ setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
+ while "$MAJ:$MIN default" clears it.
+
+ This file is available only on non-root cgroups.
+
+ io.max
+
+ The maximum bandwidth and/or iops setting, only available if
+ blk-throttle is enabled. The file is of the following format.
-4-3. Per-Controller Changes
+ $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
-4-3-1. blkio
+ ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
+ read/write IOs per second. "max" indicates no limit. Writing
+ to the file follows the same format but the individual
+ settings may be omitted or specified in any order.
-- blk-throttle becomes properly hierarchical.
+ This file is available only on non-root cgroups.
-4-3-2. cpuset
+5-4-2. cpuset
- Tasks are kept in empty cpusets after hotplug and take on the masks
of the nearest non-empty ancestor, instead of being moved to it.
masks of the nearest non-empty ancestor.
-4-3-3. memory
+5-4-3. memory
- use_hierarchy is on by default and the cgroup file for the flag is
not created.
memory.low, memory.high, and memory.max will use the string "max" to
indicate and set the highest possible value.
-5. Planned Changes
+6. Planned Changes
-5-1. CAP for resource control
+6-1. CAP for resource control
Unified hierarchy will require one of the capabilities(7), which is
yet to be decided, for all resource control related knobs. Process