|  |  | 
|  | Cgroup unified hierarchy | 
|  |  | 
|  | April, 2014		Tejun Heo <tj@kernel.org> | 
|  |  | 
|  | This document describes the changes made by unified hierarchy and | 
|  | their rationales.  It will eventually be merged into the main cgroup | 
|  | documentation. | 
|  |  | 
|  | CONTENTS | 
|  |  | 
|  | 1. Background | 
|  | 2. Basic Operation | 
|  | 2-1. Mounting | 
|  | 2-2. cgroup.subtree_control | 
|  | 2-3. cgroup.controllers | 
|  | 3. Structural Constraints | 
|  | 3-1. Top-down | 
|  | 3-2. No internal tasks | 
|  | 4. Delegation | 
|  | 4-1. Model of delegation | 
|  | 4-2. Common ancestor rule | 
|  | 5. Other Changes | 
|  | 5-1. [Un]populated Notification | 
|  | 5-2. Other Core Changes | 
|  | 5-3. Per-Controller Changes | 
|  | 5-3-1. blkio | 
|  | 5-3-2. cpuset | 
|  | 5-3-3. memory | 
|  | 6. Planned Changes | 
|  | 6-1. CAP for resource control | 
|  |  | 
|  |  | 
|  | 1. Background | 
|  |  | 
|  | cgroup allows an arbitrary number of hierarchies and each hierarchy | 
|  | can host any number of controllers.  While this seems to provide a | 
|  | high level of flexibility, it isn't quite useful in practice. | 
|  |  | 
|  | For example, as there is only one instance of each controller, utility | 
|  | type controllers such as freezer which can be useful in all | 
|  | hierarchies can only be used in one.  The issue is exacerbated by the | 
|  | fact that controllers can't be moved around once hierarchies are | 
|  | populated.  Another issue is that all controllers bound to a hierarchy | 
|  | are forced to have exactly the same view of the hierarchy.  It isn't | 
|  | possible to vary the granularity depending on the specific controller. | 
|  |  | 
|  | In practice, these issues heavily limit which controllers can be put | 
|  | on the same hierarchy and most configurations resort to putting each | 
|  | controller on its own hierarchy.  Only closely related ones, such as | 
|  | the cpu and cpuacct controllers, make sense to put on the same | 
|  | hierarchy.  This often means that userland ends up managing multiple | 
|  | similar hierarchies repeating the same steps on each hierarchy | 
|  | whenever a hierarchy management operation is necessary. | 
|  |  | 
|  | Unfortunately, support for multiple hierarchies comes at a steep cost. | 
|  | Internal implementation in cgroup core proper is dazzlingly | 
|  | complicated but more importantly the support for multiple hierarchies | 
|  | restricts how cgroup is used in general and what controllers can do. | 
|  |  | 
|  | There's no limit on how many hierarchies there may be, which means | 
|  | that a task's cgroup membership can't be described in finite length. | 
|  | The key may contain any varying number of entries and is unlimited in | 
|  | length, which makes it highly awkward to handle and leads to addition | 
|  | of controllers which exist only to identify membership, which in turn | 
|  | exacerbates the original problem. | 
|  |  | 
|  | Also, as a controller can't have any expectation regarding what shape | 
|  | of hierarchies other controllers would be on, each controller has to | 
|  | assume that all other controllers are operating on completely | 
|  | orthogonal hierarchies.  This makes it impossible, or at least very | 
|  | cumbersome, for controllers to cooperate with each other. | 
|  |  | 
|  | In most use cases, putting controllers on hierarchies which are | 
|  | completely orthogonal to each other isn't necessary.  What usually is | 
|  | called for is the ability to have differing levels of granularity | 
|  | depending on the specific controller.  In other words, hierarchy may | 
|  | be collapsed from leaf towards root when viewed from specific | 
|  | controllers.  For example, a given configuration might not care about | 
|  | how memory is distributed beyond a certain level while still wanting | 
|  | to control how CPU cycles are distributed. | 
|  |  | 
|  | Unified hierarchy is the next version of cgroup interface.  It aims to | 
|  | address the aforementioned issues by having more structure while | 
|  | retaining enough flexibility for most use cases.  Various other | 
|  | general and controller-specific interface issues are also addressed in | 
|  | the process. | 
|  |  | 
|  |  | 
|  | 2. Basic Operation | 
|  |  | 
|  | 2-1. Mounting | 
|  |  | 
|  | Currently, unified hierarchy can be mounted with the following mount | 
|  | command.  Note that this is still under development and scheduled to | 
|  | change soon. | 
|  |  | 
|  | mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT | 
|  |  | 
|  | All controllers which support the unified hierarchy and are not bound | 
|  | to other hierarchies are automatically bound to unified hierarchy and | 
|  | show up at the root of it.  Controllers which are enabled only in the | 
|  | root of unified hierarchy can be bound to other hierarchies.  This | 
|  | allows mixing unified hierarchy with the traditional multiple | 
|  | hierarchies in a fully backward compatible way. | 
|  |  | 
|  | For development purposes, the following boot parameter makes all | 
|  | controllers to appear on the unified hierarchy whether supported or | 
|  | not. | 
|  |  | 
|  | cgroup__DEVEL__legacy_files_on_dfl | 
|  |  | 
|  | A controller can be moved across hierarchies only after the controller | 
|  | is no longer referenced in its current hierarchy.  Because per-cgroup | 
|  | controller states are destroyed asynchronously and controllers may | 
|  | have lingering references, a controller may not show up immediately on | 
|  | the unified hierarchy after the final umount of the previous | 
|  | hierarchy.  Similarly, a controller should be fully disabled to be | 
|  | moved out of the unified hierarchy and it may take some time for the | 
|  | disabled controller to become available for other hierarchies; | 
|  | furthermore, due to dependencies among controllers, other controllers | 
|  | may need to be disabled too. | 
|  |  | 
|  | While useful for development and manual configurations, dynamically | 
|  | moving controllers between the unified and other hierarchies is | 
|  | strongly discouraged for production use.  It is recommended to decide | 
|  | the hierarchies and controller associations before starting using the | 
|  | controllers. | 
|  |  | 
|  |  | 
|  | 2-2. cgroup.subtree_control | 
|  |  | 
|  | All cgroups on unified hierarchy have a "cgroup.subtree_control" file | 
|  | which governs which controllers are enabled on the children of the | 
|  | cgroup.  Let's assume a hierarchy like the following. | 
|  |  | 
|  | root - A - B - C | 
|  | \ D | 
|  |  | 
|  | root's "cgroup.subtree_control" file determines which controllers are | 
|  | enabled on A.  A's on B.  B's on C and D.  This coincides with the | 
|  | fact that controllers on the immediate sub-level are used to | 
|  | distribute the resources of the parent.  In fact, it's natural to | 
|  | assume that resource control knobs of a child belong to its parent. | 
|  | Enabling a controller in a "cgroup.subtree_control" file declares that | 
|  | distribution of the respective resources of the cgroup will be | 
|  | controlled.  Note that this means that controller enable states are | 
|  | shared among siblings. | 
|  |  | 
|  | When read, the file contains a space-separated list of currently | 
|  | enabled controllers.  A write to the file should contain a | 
|  | space-separated list of controllers with '+' or '-' prefixed (without | 
|  | the quotes).  Controllers prefixed with '+' are enabled and '-' | 
|  | disabled.  If a controller is listed multiple times, the last entry | 
|  | wins.  The specific operations are executed atomically - either all | 
|  | succeed or fail. | 
|  |  | 
|  |  | 
|  | 2-3. cgroup.controllers | 
|  |  | 
|  | Read-only "cgroup.controllers" file contains a space-separated list of | 
|  | controllers which can be enabled in the cgroup's | 
|  | "cgroup.subtree_control" file. | 
|  |  | 
|  | In the root cgroup, this lists controllers which are not bound to | 
|  | other hierarchies and the content changes as controllers are bound to | 
|  | and unbound from other hierarchies. | 
|  |  | 
|  | In non-root cgroups, the content of this file equals that of the | 
|  | parent's "cgroup.subtree_control" file as only controllers enabled | 
|  | from the parent can be used in its children. | 
|  |  | 
|  |  | 
|  | 3. Structural Constraints | 
|  |  | 
|  | 3-1. Top-down | 
|  |  | 
|  | As it doesn't make sense to nest control of an uncontrolled resource, | 
|  | all non-root "cgroup.subtree_control" files can only contain | 
|  | controllers which are enabled in the parent's "cgroup.subtree_control" | 
|  | file.  A controller can be enabled only if the parent has the | 
|  | controller enabled and a controller can't be disabled if one or more | 
|  | children have it enabled. | 
|  |  | 
|  |  | 
|  | 3-2. No internal tasks | 
|  |  | 
|  | One long-standing issue that cgroup faces is the competition between | 
|  | tasks belonging to the parent cgroup and its children cgroups.  This | 
|  | is inherently nasty as two different types of entities compete and | 
|  | there is no agreed-upon obvious way to handle it.  Different | 
|  | controllers are doing different things. | 
|  |  | 
|  | The cpu controller considers tasks and cgroups as equivalents and maps | 
|  | nice levels to cgroup weights.  This works for some cases but falls | 
|  | flat when children should be allocated specific ratios of CPU cycles | 
|  | and the number of internal tasks fluctuates - the ratios constantly | 
|  | change as the number of competing entities fluctuates.  There also are | 
|  | other issues.  The mapping from nice level to weight isn't obvious or | 
|  | universal, and there are various other knobs which simply aren't | 
|  | available for tasks. | 
|  |  | 
|  | The blkio controller implicitly creates a hidden leaf node for each | 
|  | cgroup to host the tasks.  The hidden leaf has its own copies of all | 
|  | the knobs with "leaf_" prefixed.  While this allows equivalent control | 
|  | over internal tasks, it's with serious drawbacks.  It always adds an | 
|  | extra layer of nesting which may not be necessary, makes the interface | 
|  | messy and significantly complicates the implementation. | 
|  |  | 
|  | The memory controller currently doesn't have a way to control what | 
|  | happens between internal tasks and child cgroups and the behavior is | 
|  | not clearly defined.  There have been attempts to add ad-hoc behaviors | 
|  | and knobs to tailor the behavior to specific workloads.  Continuing | 
|  | this direction will lead to problems which will be extremely difficult | 
|  | to resolve in the long term. | 
|  |  | 
|  | Multiple controllers struggle with internal tasks and came up with | 
|  | different ways to deal with it; unfortunately, all the approaches in | 
|  | use now are severely flawed and, furthermore, the widely different | 
|  | behaviors make cgroup as whole highly inconsistent. | 
|  |  | 
|  | It is clear that this is something which needs to be addressed from | 
|  | cgroup core proper in a uniform way so that controllers don't need to | 
|  | worry about it and cgroup as a whole shows a consistent and logical | 
|  | behavior.  To achieve that, unified hierarchy enforces the following | 
|  | structural constraint: | 
|  |  | 
|  | Except for the root, only cgroups which don't contain any task may | 
|  | have controllers enabled in their "cgroup.subtree_control" files. | 
|  |  | 
|  | Combined with other properties, this guarantees that, when a | 
|  | controller is looking at the part of the hierarchy which has it | 
|  | enabled, tasks are always only on the leaves.  This rules out | 
|  | situations where child cgroups compete against internal tasks of the | 
|  | parent. | 
|  |  | 
|  | There are two things to note.  Firstly, the root cgroup is exempt from | 
|  | the restriction.  Root contains tasks and anonymous resource | 
|  | consumption which can't be associated with any other cgroup and | 
|  | requires special treatment from most controllers.  How resource | 
|  | consumption in the root cgroup is governed is up to each controller. | 
|  |  | 
|  | Secondly, the restriction doesn't take effect if there is no enabled | 
|  | controller in the cgroup's "cgroup.subtree_control" file.  This is | 
|  | important as otherwise it wouldn't be possible to create children of a | 
|  | populated cgroup.  To control resource distribution of a cgroup, the | 
|  | cgroup must create children and transfer all its tasks to the children | 
|  | before enabling controllers in its "cgroup.subtree_control" file. | 
|  |  | 
|  |  | 
|  | 4. Delegation | 
|  |  | 
|  | 4-1. Model of delegation | 
|  |  | 
|  | A cgroup can be delegated to a less privileged user by granting write | 
|  | access of the directory and its "cgroup.procs" file to the user.  Note | 
|  | that the resource control knobs in a given directory concern the | 
|  | resources of the parent and thus must not be delegated along with the | 
|  | directory. | 
|  |  | 
|  | Once delegated, the user can build sub-hierarchy under the directory, | 
|  | organize processes as it sees fit and further distribute the resources | 
|  | it got from the parent.  The limits and other settings of all resource | 
|  | controllers are hierarchical and regardless of what happens in the | 
|  | delegated sub-hierarchy, nothing can escape the resource restrictions | 
|  | imposed by the parent. | 
|  |  | 
|  | Currently, cgroup doesn't impose any restrictions on the number of | 
|  | cgroups in or nesting depth of a delegated sub-hierarchy; however, | 
|  | this may in the future be limited explicitly. | 
|  |  | 
|  |  | 
|  | 4-2. Common ancestor rule | 
|  |  | 
|  | On the unified hierarchy, to write to a "cgroup.procs" file, in | 
|  | addition to the usual write permission to the file and uid match, the | 
|  | writer must also have write access to the "cgroup.procs" file of the | 
|  | common ancestor of the source and destination cgroups.  This prevents | 
|  | delegatees from smuggling processes across disjoint sub-hierarchies. | 
|  |  | 
|  | Let's say cgroups C0 and C1 have been delegated to user U0 who created | 
|  | C00, C01 under C0 and C10 under C1 as follows. | 
|  |  | 
|  | ~~~~~~~~~~~~~ - C0 - C00 | 
|  | ~ cgroup    ~      \ C01 | 
|  | ~ hierarchy ~ | 
|  | ~~~~~~~~~~~~~ - C1 - C10 | 
|  |  | 
|  | C0 and C1 are separate entities in terms of resource distribution | 
|  | regardless of their relative positions in the hierarchy.  The | 
|  | resources the processes under C0 are entitled to are controlled by | 
|  | C0's ancestors and may be completely different from C1.  It's clear | 
|  | that the intention of delegating C0 to U0 is allowing U0 to organize | 
|  | the processes under C0 and further control the distribution of C0's | 
|  | resources. | 
|  |  | 
|  | On traditional hierarchies, if a task has write access to "tasks" or | 
|  | "cgroup.procs" file of a cgroup and its uid agrees with the target, it | 
|  | can move the target to the cgroup.  In the above example, U0 will not | 
|  | only be able to move processes in each sub-hierarchy but also across | 
|  | the two sub-hierarchies, effectively allowing it to violate the | 
|  | organizational and resource restrictions implied by the hierarchical | 
|  | structure above C0 and C1. | 
|  |  | 
|  | On the unified hierarchy, let's say U0 wants to write the pid of a | 
|  | process which has a matching uid and is currently in C10 into | 
|  | "C00/cgroup.procs".  U0 obviously has write access to the file and | 
|  | migration permission on the process; however, the common ancestor of | 
|  | the source cgroup C10 and the destination cgroup C00 is above the | 
|  | points of delegation and U0 would not have write access to its | 
|  | "cgroup.procs" and thus be denied with -EACCES. | 
|  |  | 
|  |  | 
|  | 5. Other Changes | 
|  |  | 
|  | 5-1. [Un]populated Notification | 
|  |  | 
|  | cgroup users often need a way to determine when a cgroup's | 
|  | subhierarchy becomes empty so that it can be cleaned up.  cgroup | 
|  | currently provides release_agent for it; unfortunately, this mechanism | 
|  | is riddled with issues. | 
|  |  | 
|  | - It delivers events by forking and execing a userland binary | 
|  | specified as the release_agent.  This is a long deprecated method of | 
|  | notification delivery.  It's extremely heavy, slow and cumbersome to | 
|  | integrate with larger infrastructure. | 
|  |  | 
|  | - There is single monitoring point at the root.  There's no way to | 
|  | delegate management of a subtree. | 
|  |  | 
|  | - The event isn't recursive.  It triggers when a cgroup doesn't have | 
|  | any tasks or child cgroups.  Events for internal nodes trigger only | 
|  | after all children are removed.  This again makes it impossible to | 
|  | delegate management of a subtree. | 
|  |  | 
|  | - Events are filtered from the kernel side.  A "notify_on_release" | 
|  | file is used to subscribe to or suppress release events.  This is | 
|  | unnecessarily complicated and probably done this way because event | 
|  | delivery itself was expensive. | 
|  |  | 
|  | Unified hierarchy implements an interface file "cgroup.populated" | 
|  | which can be used to monitor whether the cgroup's subhierarchy has | 
|  | tasks in it or not.  Its value is 0 if there is no task in the cgroup | 
|  | and its descendants; otherwise, 1.  poll and [id]notify events are | 
|  | triggered when the value changes. | 
|  |  | 
|  | This is significantly lighter and simpler and trivially allows | 
|  | delegating management of subhierarchy - subhierarchy monitoring can | 
|  | block further propagation simply by putting itself or another process | 
|  | in the subhierarchy and monitor events that it's interested in from | 
|  | there without interfering with monitoring higher in the tree. | 
|  |  | 
|  | In unified hierarchy, the release_agent mechanism is no longer | 
|  | supported and the interface files "release_agent" and | 
|  | "notify_on_release" do not exist. | 
|  |  | 
|  |  | 
|  | 5-2. Other Core Changes | 
|  |  | 
|  | - None of the mount options is allowed. | 
|  |  | 
|  | - remount is disallowed. | 
|  |  | 
|  | - rename(2) is disallowed. | 
|  |  | 
|  | - The "tasks" file is removed.  Everything should at process | 
|  | granularity.  Use the "cgroup.procs" file instead. | 
|  |  | 
|  | - The "cgroup.procs" file is not sorted.  pids will be unique unless | 
|  | they got recycled in-between reads. | 
|  |  | 
|  | - The "cgroup.clone_children" file is removed. | 
|  |  | 
|  |  | 
|  | 5-3. Per-Controller Changes | 
|  |  | 
|  | 5-3-1. blkio | 
|  |  | 
|  | - blk-throttle becomes properly hierarchical. | 
|  |  | 
|  |  | 
|  | 5-3-2. cpuset | 
|  |  | 
|  | - Tasks are kept in empty cpusets after hotplug and take on the masks | 
|  | of the nearest non-empty ancestor, instead of being moved to it. | 
|  |  | 
|  | - A task can be moved into an empty cpuset, and again it takes on the | 
|  | masks of the nearest non-empty ancestor. | 
|  |  | 
|  |  | 
|  | 5-3-3. memory | 
|  |  | 
|  | - use_hierarchy is on by default and the cgroup file for the flag is | 
|  | not created. | 
|  |  | 
|  | - The original lower boundary, the soft limit, is defined as a limit | 
|  | that is per default unset.  As a result, the set of cgroups that | 
|  | global reclaim prefers is opt-in, rather than opt-out.  The costs | 
|  | for optimizing these mostly negative lookups are so high that the | 
|  | implementation, despite its enormous size, does not even provide the | 
|  | basic desirable behavior.  First off, the soft limit has no | 
|  | hierarchical meaning.  All configured groups are organized in a | 
|  | global rbtree and treated like equal peers, regardless where they | 
|  | are located in the hierarchy.  This makes subtree delegation | 
|  | impossible.  Second, the soft limit reclaim pass is so aggressive | 
|  | that it not just introduces high allocation latencies into the | 
|  | system, but also impacts system performance due to overreclaim, to | 
|  | the point where the feature becomes self-defeating. | 
|  |  | 
|  | The memory.low boundary on the other hand is a top-down allocated | 
|  | reserve.  A cgroup enjoys reclaim protection when it and all its | 
|  | ancestors are below their low boundaries, which makes delegation of | 
|  | subtrees possible.  Secondly, new cgroups have no reserve per | 
|  | default and in the common case most cgroups are eligible for the | 
|  | preferred reclaim pass.  This allows the new low boundary to be | 
|  | efficiently implemented with just a minor addition to the generic | 
|  | reclaim code, without the need for out-of-band data structures and | 
|  | reclaim passes.  Because the generic reclaim code considers all | 
|  | cgroups except for the ones running low in the preferred first | 
|  | reclaim pass, overreclaim of individual groups is eliminated as | 
|  | well, resulting in much better overall workload performance. | 
|  |  | 
|  | - The original high boundary, the hard limit, is defined as a strict | 
|  | limit that can not budge, even if the OOM killer has to be called. | 
|  | But this generally goes against the goal of making the most out of | 
|  | the available memory.  The memory consumption of workloads varies | 
|  | during runtime, and that requires users to overcommit.  But doing | 
|  | that with a strict upper limit requires either a fairly accurate | 
|  | prediction of the working set size or adding slack to the limit. | 
|  | Since working set size estimation is hard and error prone, and | 
|  | getting it wrong results in OOM kills, most users tend to err on the | 
|  | side of a looser limit and end up wasting precious resources. | 
|  |  | 
|  | The memory.high boundary on the other hand can be set much more | 
|  | conservatively.  When hit, it throttles allocations by forcing them | 
|  | into direct reclaim to work off the excess, but it never invokes the | 
|  | OOM killer.  As a result, a high boundary that is chosen too | 
|  | aggressively will not terminate the processes, but instead it will | 
|  | lead to gradual performance degradation.  The user can monitor this | 
|  | and make corrections until the minimal memory footprint that still | 
|  | gives acceptable performance is found. | 
|  |  | 
|  | In extreme cases, with many concurrent allocations and a complete | 
|  | breakdown of reclaim progress within the group, the high boundary | 
|  | can be exceeded.  But even then it's mostly better to satisfy the | 
|  | allocation from the slack available in other groups or the rest of | 
|  | the system than killing the group.  Otherwise, memory.max is there | 
|  | to limit this type of spillover and ultimately contain buggy or even | 
|  | malicious applications. | 
|  |  | 
|  | - The original control file names are unwieldy and inconsistent in | 
|  | many different ways.  For example, the upper boundary hit count is | 
|  | exported in the memory.failcnt file, but an OOM event count has to | 
|  | be manually counted by listening to memory.oom_control events, and | 
|  | lower boundary / soft limit events have to be counted by first | 
|  | setting a threshold for that value and then counting those events. | 
|  | Also, usage and limit files encode their units in the filename. | 
|  | That makes the filenames very long, even though this is not | 
|  | information that a user needs to be reminded of every time they type | 
|  | out those names. | 
|  |  | 
|  | To address these naming issues, as well as to signal clearly that | 
|  | the new interface carries a new configuration model, the naming | 
|  | conventions in it necessarily differ from the old interface. | 
|  |  | 
|  | - The original limit files indicate the state of an unset limit with a | 
|  | Very High Number, and a configured limit can be unset by echoing -1 | 
|  | into those files.  But that very high number is implementation and | 
|  | architecture dependent and not very descriptive.  And while -1 can | 
|  | be understood as an underflow into the highest possible value, -2 or | 
|  | -10M etc. do not work, so it's not consistent. | 
|  |  | 
|  | memory.low, memory.high, and memory.max will use the string "max" to | 
|  | indicate and set the highest possible value. | 
|  |  | 
|  | 6. Planned Changes | 
|  |  | 
|  | 6-1. CAP for resource control | 
|  |  | 
|  | Unified hierarchy will require one of the capabilities(7), which is | 
|  | yet to be decided, for all resource control related knobs.  Process | 
|  | organization operations - creation of sub-cgroups and migration of | 
|  | processes in sub-hierarchies may be delegated by changing the | 
|  | ownership and/or permissions on the cgroup directory and | 
|  | "cgroup.procs" interface file; however, all operations which affect | 
|  | resource control - writes to a "cgroup.subtree_control" file or any | 
|  | controller-specific knobs - will require an explicit CAP privilege. | 
|  |  | 
|  | This, in part, is to prevent the cgroup interface from being | 
|  | inadvertently promoted to programmable API used by non-privileged | 
|  | binaries.  cgroup exposes various aspects of the system in ways which | 
|  | aren't properly abstracted for direct consumption by regular programs. | 
|  | This is an administration interface much closer to sysctl knobs than | 
|  | system calls.  Even the basic access model, being filesystem path | 
|  | based, isn't suitable for direct consumption.  There's no way to | 
|  | access "my cgroup" in a race-free way or make multiple operations | 
|  | atomic against migration to another cgroup. | 
|  |  | 
|  | Another aspect is that, for better or for worse, the cgroup interface | 
|  | goes through far less scrutiny than regular interfaces for | 
|  | unprivileged userland.  The upside is that cgroup is able to expose | 
|  | useful features which may not be suitable for general consumption in a | 
|  | reasonable time frame.  It provides a relatively short path between | 
|  | internal details and userland-visible interface.  Of course, this | 
|  | shortcut comes with high risk.  We go through what we go through for | 
|  | general kernel APIs for good reasons.  It may end up leaking internal | 
|  | details in a way which can exert significant pain by locking the | 
|  | kernel into a contract that can't be maintained in a reasonable | 
|  | manner. | 
|  |  | 
|  | Also, due to the specific nature, cgroup and its controllers don't | 
|  | tend to attract attention from a wide scope of developers.  cgroup's | 
|  | short history is already fraught with severely mis-designed | 
|  | interfaces, unnecessary commitments to and exposing of internal | 
|  | details, broken and dangerous implementations of various features. | 
|  |  | 
|  | Keeping cgroup as an administration interface is both advantageous for | 
|  | its role and imperative given its nature.  Some of the cgroup features | 
|  | may make sense for unprivileged access.  If deemed justified, those | 
|  | must be further abstracted and implemented as a different interface, | 
|  | be it a system call or process-private filesystem, and survive through | 
|  | the scrutiny that any interface for general consumption is required to | 
|  | go through. | 
|  |  | 
|  | Requiring CAP is not a complete solution but should serve as a | 
|  | significant deterrent against spraying cgroup usages in non-privileged | 
|  | programs. |