aboutsummaryrefslogtreecommitdiff
blob: d44d93d144837949cc81a3f4dd32f36878e603a2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# Users, Groups, UIDs and GIDs on `systemd` systems

Here's a summary of the requirements `systemd` (and Linux) make on UID/GID
assignments and their ranges.

Note that while in theory UIDs and GIDs are orthogonal concepts they really
aren't IRL. With that in mind, when we discuss UIDs below it should be assumed
that whatever we say about UIDs applies to GIDs in mostly the same way, and all
the special assignments and ranges for UIDs always have mostly the same
validity for GIDs too.

## Special Linux UIDs

In theory, the range of the C type `uid_t` is 32bit wide on Linux,
i.e. 0…4294967295. However, four UIDs are special on Linux:

1. 0 → The `root` super-user

2. 65534 → The `nobody` UID, also called the "overflow" UID or similar. It's
   where various subsystems map unmappable users to, for example NFS or user
   namespacing. (The latter can be changed with a sysctl during runtime, but
   that's not supported on `systemd`. If you do change it you void your
   warranty.) Because Fedora is a bit confused the `nobody` user is called
   `nfsnobody` there (and they have a different `nobody` user at UID 99). I
   hope this will be corrected eventually though. (Also, some distributions
   call the `nobody` group `nogroup`. I wish they didn't.)

3. 4294967295, aka "32bit `(uid_t) -1`" → This UID is not a valid user ID, as
   setresuid(), chown() and friends treat -1 as a special request to not change
   the UID of the process/file. This UID is hence not available for assignment
   to users in the user database.

4. 65535, aka "16bit `(uid_t) -1`" → Once upon a time `uid_t` used to be 16bit, and
   programs compiled for that would hence assume that `(uid_t) -1` is 65535. This
   UID is hence not usable either.

The `nss-systemd` glibc NSS module will synthesize user database records for
the UIDs 0 and 65534 if the system user database doesn't list them. This means
that any system where this module is enabled works to some minimal level
without `/etc/passwd`.

## Special Distribution UID ranges

Distributions generally split the available UID range in two:

1. 1…999 → System users. These are users that do not map to actual "human"
   users, but are used as security identities for system daemons, to implement
   privilege separation and run system daemons with minimal privileges.

2. 1000…65533 and 65536…4294967294 → Everything else, i.e. regular (human) users.

Note that most distributions allow changing the boundary between system and
regular users, even during runtime as user configuration. Moreover, some older
systems placed the boundary at 499/500, or even 99/100. In `systemd`, the
boundary is configurable only during compilation time, as this should be a
decision for distribution builders, not for users. Moreover, we strongly
discourage downstreams to change the boundary from the upstream default of
999/1000.

Also note that programs such as `adduser` tend to allocate from a subset of the
available regular user range only, usually 1000..60000. And it's also usually
user-configurable, too.

Note that systemd requires that system users and groups are resolvable without
networking available — a requirement that is not made for regular users. This
means regular users may be stored in remote LDAP or NIS databases, but system
users may not (except when there's a consistent local cache kept, that is
available during earliest boot, including in the initial RAM disk).

## Special `systemd` GIDs

`systemd` defines no special UIDs beyond what Linux already defines (see
above). However, it does define some special group/GID assignments, which are
primarily used for `systemd-udevd`'s device management. The precise list of the
currently defined groups is found in this `sysusers.d` snippet:
[basic.conf](https://raw.githubusercontent.com/systemd/systemd/master/sysusers.d/basic.conf.in)

It's strongly recommended that downstream distributions include these groups in
their default group databases.

Note that the actual GID numbers assigned to these groups do not have to be
constant beyond a specific system. There's one exception however: the `tty`
group must have the GID 5. That's because it must be encoded in the `devpts`
mount parameters during earliest boot, at a time where NSS lookups are not
possible. (Note that the actual GID can be changed during `systemd` build time,
but downstreams are strongly advised against doing that.)

## Special `systemd` UID ranges

`systemd` defines a number of special UID ranges:

1. 61184…65519 → UIDs for dynamic users are allocated from this range (see the
   `DynamicUser=` documentation in
   [`systemd.exec(5)`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html)). This
   range has been chosen so that it is below the 16bit boundary (i.e. below
   65535), in order to provide compatibility with container environments that
   assign a 64K range of UIDs to containers using user namespacing. This range
   is above the 60000 boundary, so that its allocations are unlikely to be
   affected by `adduser` allocations (see above). And we leave some room
   upwards for other purposes. (And if you wonder why precisely these numbers:
   if you write them in hexadecimal, they might make more sense: 0xEF00 and
   0xFFEF). The `nss-systemd` module will synthesize user records implicitly
   for all currently allocated dynamic users from this range. Thus, NSS-based
   user record resolving works correctly without those users being in
   `/etc/passwd`.

2. 524288…1879048191 → UID range for `systemd-nspawn`'s automatic allocation of
   per-container UID ranges. When the `--private-users=pick` switch is used (or
   `-U`) then it will automatically find a so far unused 16bit subrange of this
   range and assign it to the container. The range is picked so that the upper
   16bit of the 32bit UIDs are constant for all users of the container, while
   the lower 16bit directly encode the 65536 UIDs assigned to the
   container. This mode of allocation means that the upper 16bit of any UID
   assigned to a container are kind of a "container ID", while the lower 16bit
   directly expose the container's own UID numbers. If you wonder why precisely
   these numbers, consider them in hexadecimal: 0x00080000…0x6FFFFFFF. This
   range is above the 16bit boundary. Moreover it's below the 31bit boundary,
   as some broken code (specifically: the kernel's `devpts` file system)
   erroneously considers UIDs signed integers, and hence can't deal with values
   above 2^31. The `nss-mymachines` glibc NSS module will synthesize user
   database records for all UIDs assigned to a running container from this
   range.

Note for both allocation ranges: when an UID allocation takes place NSS is
checked for collisions first, and a different UID is picked if an entry is
found. Thus, the user database is used as synchronization mechanism to ensure
exclusive ownership of UIDs and UID ranges. To ensure compatibility with other
subsystems allocating from the same ranges it is hence essential that they
ensure that whatever they pick shows up in the user/group databases, either by
providing an NSS module, or by adding entries directly to `/etc/passwd` and
`/etc/group`. For performance reasons, do note that `systemd-nspawn` will only
do an NSS check for the first UID of the range it allocates, not all 65536 of
them. Also note that while the allocation logic is operating, the glibc
`lckpwdf()` user database lock is taken, in order to make this logic race-free.

## Figuring out the system's UID boundaries

The most important boundaries of the local system may be queried with
`pkg-config`:

```
$ pkg-config --variable=systemuidmax systemd
999
$ pkg-config --variable=dynamicuidmin systemd
61184
$ pkg-config --variable=dynamicuidmax systemd
65519
$ pkg-config --variable=containeruidbasemin systemd
524288
$ pkg-config --variable=containeruidbasemax systemd
1878982656
```

(Note that the latter encodes the maximum UID *base* `systemd-nspawn` might
pick — given that 64K UIDs are assigned to each container according to this
allocation logic, the maximum UID used for this range is hence
1878982656+65535=1879048191.)

Note that systemd does not make any of these values runtime-configurable. All
these boundaries are chosen during build time. That said, the system UID/GID
boundary is traditionally configured in /etc/login.defs, though systemd won't
look there during runtime.

## Considerations for container managers

If you hack on a container manager, and wonder how and how many UIDs best to
assign to your containers, here are a few recommendations:

1. Definitely, don't assign less than 65536 UIDs/GIDs. After all the `nobody`
user has magic properties, and hence should be available in your container, and
given that it's assigned the UID 65534, you should really cover the full 16bit
range in your container. Note that systemd will — as mentioned — synthesize
user records for the `nobody` user, and assumes its availability in various
other parts of its codebase, too, hence assigning fewer users means you lose
compatibility with running systemd code inside your container. And most likely
other packages make similar restrictions.

2. While it's fine to assign more than 65536 UIDs/GIDs to a container, there's
most likely not much value in doing so, as Linux distributions won't use the
higher ranges by default (as mentioned neither `adduser` nor `systemd`'s
dynamic user concept allocate from above the 16bit range). Unless you actively
care for nested containers, it's hence probably a good idea to allocate exactly
65536 UIDs per container, and neither less nor more. A pretty side-effect is
that by doing so, you expose the same number of UIDs per container as Linux 2.2
supported for the whole system, back in the days.

3. Consider allocating UID ranges for containers so that the first UID you
assign has the lower 16bits all set to zero. That way, the upper 16bits become
a container ID of some kind, while the lower 16bits directly encode the
internal container UID. This is the way `systemd-nspawn` allocates UID ranges
(see above). Following this allocation logic ensures best compability with
`systemd-nspawn` and all other container managers following the scheme, as it
is sufficient then to check NSS for the first UID you pick regarding conflicts,
as that's what they do, too. Moreover, it makes `chown()`ing container file
system trees nicely robust to interruptions: as the external UID encodes the
internal UID in a fixed way, it's very easy to adjust the container's base UID
without the need to know the original base UID: to change the container base,
just mask away the upper 16bit, and insert the upper 16bit of the new container
base instead. Here are the easy conversions to derive the internal UID, the
external UID, and the container base UID from each other:

    ```
    INTERNAL_UID = EXTERNAL_UID & 0x0000FFFF
    CONTAINER_BASE_UID = EXTERNAL_UID & 0xFFFF0000
    EXTERNAL_UID = INTERNAL_UID | CONTAINER_BASE_UID
    ```

4. When picking a UID range for containers, make sure to check NSS first, with
a simple `getpwuid()` call: if there's already a user record for the first UID
you want to pick, then it's already in use: pick a different one. Wrap that
call in a `lckpwdf()` + `ulckpwdf()` pair, to make allocation
race-free. Provide an NSS module that makes all UIDs you end up taking show up
in the user database, and make sure that the NSS module returns up-to-date
information before you release the lock, so that other system components can
safely use the NSS user database as allocation check, too. Note that if you
follow this scheme no changes to `/etc/passwd` need to be made, thus minimizing
the artifacts the container manager persistently leaves in the system.

## Summary

|               UID/GID | Purpose               | Defined By    | Listed in                     |
|-----------------------|-----------------------|---------------|-------------------------------|
|                     0 | `root` user           | Linux         | `/etc/passwd` + `nss-systemd` |
|                   1…4 | System users          | Distributions | `/etc/passwd`                 |
|                     5 | `tty` group           | `systemd`     | `/etc/passwd`                 |
|                 6…999 | System users          | Distributions | `/etc/passwd`                 |
|            1000…60000 | Regular users         | Distributions | `/etc/passwd` + LDAP/NIS/…    |
|           60001…61183 | Unused                |               |                               |
|           61184…65519 | Dynamic service users | `systemd`     | `nss-systemd`                 |
|           65520…65533 | Unused                |               |                               |
|                 65534 | `nobody` user         | Linux         | `/etc/passwd` + `nss-systemd` |
|                 65535 | 16bit `(uid_t) -1`    | Linux         |                               |
|          65536…524287 | Unused                |               |                               |
|     524288…1879048191 | Container UID ranges  | `systemd`     | `nss-mymachines`              |
| 1879048192…4294967294 | Unused                |               |                               |
|            4294967295 | 32bit `(uid_t) -1`    | Linux         |                               |

Note that "Unused" in the table above doesn't meant that these ranges are
really unused. It just means that these ranges have no well-established
pre-defined purposes between Linux, generic low-level distributions and
`systemd`. There might very well be other packages that allocate from these
ranges.