static-keys

Rust CI status Crates.io Version docs.rs

Static key is a mechanism used by Linux kernel to speed up checks of seldomly changed features. We brought it to Rust userland applications for Linux, macOS and Windows! (Currently nightly Rust required. For reasons, see FAQ).

Currently CI-tested platforms:

  • Linux

    • x86_64-unknown-linux-gnu
    • x86_64-unknown-linux-musl
    • i686-unknown-linux-gnu
    • aarch64-unknown-linux-gnu
    • riscv64gc-unknown-linux-gnu
    • loongarch64-unknown-linux-gnu
  • macOS

    • aarch64-apple-darwin
  • Windows

    • x86_64-pc-windows-msvc
    • i686-pc-windows-msvc

Note that when using cross-rs to build loongarch64-unknown-linux-gnu target, you should use latest cross-rs avaiable on GitHub. See Evian-Zhang/static-keys#4 for more details.

For more comprehensive explanations and FAQs, you can refer to GitHub Pages (中文版文档).

Motivation

It's a common practice for modern applications to be configurable, either by CLI options or config files. Those values controlled by configuration flags are usually not changed after application initialization, and are frequently accessed during the whole application lifetime.

let flag = CommandlineArgs::parse();
loop {
    if flag {
        do_something();
    }
    do_common_routines();
}

Although flag will not be modified after application initialization, the if-check still happens in each round, and in x86-64, it may be compiled to the following test-jnz instructions.

    test    eax, eax           ; Check whether eax register is 0
    jnz     do_something       ; If not zero, jump to do_something
do_common_routines:
    ; Do common routines
    ret
do_something:
    ; Do something
    jmp     do_common_routines ; Jump to do_common_routines

Although the if-check is just test-jnz instructions, it can still be speedup. What about making the check just a jmp (skip over the do_something branch) or nop (always do_something)? This is what static keys do. To put it simply, we modify the instruction at runtime. After getting the flag passed from commandline, we dynamically modify the if flag {} check to be a jmp or nop according to the flag value.

For example, if user-specified flag is false, the assembled instructions will be dynamically modified to the following nop instruction.

    nop     DWORD PTR [rax+rax*1+0x0]
do_common_routines:
    ; Do common routines
    ret
do_something:
    ; Do something
    jmp     do_common_routines

If flag is true, then we will dynamically modify the instruction to an unconditional jump instruction:

    jmp     do_something
do_common_routines:
    ; Do common routines
    ret
do_something:
    ; Do something
    jmp     do_common_routines

There is no more test and conditional jumps, just a nop (which means this instruction does nothing) or jmp.

Although replacing a test-jnz pair to nop may be minor improvement, however, as documented in linux kernel, if the configuration check involves global variables, this replacement can decrease memory cache pressure. And in server applications, such configuration may be shared between multiple threads in Arcs, which has much more overhead than just nop or jmp.

Usage

To use this crate, currently nightly Rust is required. And in the crate root top, you should declare usage of unstable feature asm_goto.

#![allow(unused)]
#![feature(asm_goto)]
fn main() {
}

First, add this crate to your Cargo.toml:

[dependencies]
static-keys = "0.6"

At the beginning of main function, you should invoke static_keys::global_init to initialize.

fn main() {
    static_keys::global_init();
    // Do other things...
}

Then you should define a static key to hold the value affected by user-controlled flag, and enable or disable it according to the user passed flag.

#![allow(unused)]
fn main() {
use static_keys::define_static_key_false;
struct CommandlineArgs {}
impl CommandlineArgs { fn parse() -> bool { true } }
// FLAG_STATIC_KEY is defined with initial value `false`
define_static_key_false!(FLAG_STATIC_KEY);

fn application_initialize() {
    let flag = CommandlineArgs::parse();
    if flag {
        unsafe {
            FLAG_STATIC_KEY.enable();
        }
    }
}
}

Note that you can enable or disable the static key any number of times at any time. And more importantly, it is very dangerous if you modify a static key in a multi-threads environment. Always spawn threads after you complete the modification of such static keys. And to make it more clear, it is absolutely safe to use this static key in multi-threads environment as below. The modification of static keys may be less efficient, while since the static keys are used to seldomly changed features, the modifications rarely take place, so the inefficiency does not matter. See FAQ for more explanation.

After the definition, you can use this static key at if-check as usual (you can see here and here to know more about the likely-unlikely API semantics). A static key can be used at multiple if-checks. If the static key is modified, all locations using this static key will be modified to jmp or nop accordingly.

#![allow(unused)]
fn main() {
#![feature(asm_goto)]
use static_keys::{define_static_key_false, static_branch_unlikely};
struct CommandlineArgs {}
impl CommandlineArgs { fn parse() -> bool { true } }
fn do_something() {}
fn do_common_routines() {}
define_static_key_false!(FLAG_STATIC_KEY);
fn run() {
    loop {
        if static_branch_unlikely!(FLAG_STATIC_KEY) {
            do_something();
        }
        do_common_routines();
    }
}
}

References

Internal Design

As described in the introduction, the workflow of static keys is:

  1. Globally initialize related structures.
  2. Define a static key.
  3. Modify the static key according to user-specified values.
  4. Use the static key at the if-check.

In this section, we will use three terms to make things more clear:

  • Static key

    A static variable storing information to control the enabling of static branches.

  • Jump entry

    A static variable storing information about static branches. This is used to locate the static branch.

  • Static branch

    A if-check which utilizes static keys.

Simplified Logics

In simplified logics, we can think of the static key and jump entry as the following structures:

struct StaticKey {
    enabled: bool,
    jump_entries: Vec<JumpEntry>,
}

struct JumpEntry {
    code: &'static Instruction,
}

When we modify a static key, the following things happen:

  1. The static key's enable field is modified
  2. We find all jump entries associated with this static key by jump_entries field
  3. For each jump entry, we locate the static branch by static_branch field
  4. We modify the static branch to nop/jmp according to the enable value

And when we use a static key at if-check, we just push another jump entry into the static key, which records the location of current if-check.

After understanding the simplified logics, we can then add some more comprehensive supplements to the simplified logic:

  • Jump entry location.
  • Static branch modification content.
  • Static branch modification rule.
  • Static branch modification approach.

Jump Entry Location

As described in Usage, we can use one static key at multiple if checks. As a result, one static key may be associated with multiple jump entries. However, we cannot construct a compile-time vector in ad-hoc: we cannot define a static vector, and push to this vector at compile time across the crate. As a result, we must store the jump entries in the generated binary, and construct the static key's jump entries at run time to collect associated jump entries.

In practice, we store these jump entries in an individual section in generated binary. The name of such section differs on each OS. For example, in Linux ELF, this section is named __static_keys.

Then at runtime, when initializing, we will collect jump entries to each static key. However, as the jump entries have already in a section loaded into memory, we don't want to double the memory usage to push those jump entries content into the static key's vector.

The solution is to make the jump_entries field of StaticKey a pointer instead of vector. It can just point to the jump entries in the individual section, thus decrease the memory usage. To do so, we then must sort the jump entries in such section to make sure jump entries associated with same static key are adjacent to each other, and then the jump_entries field can point to the first jump entry which associated with the static key.

To make the sort work, we should add another field to the JumpEntry: the address of static key. Then the sort can conduct according to static key address. Note that in implementaion, such addresses are all relative due to ASLR.

As a result, now the structure should be written as

struct StaticKey {
    enabled: bool,
    jump_entries: *const JumpEntry,
}

struct JumpEntry {
    code: &'static Instruction,
    /// Relative address to static key
    key: usize,
}

The jump_entries field is null at beginning, and when initializing, this field is updated to point to the first jump entry which asscoiated with this static key.

Static Branch Modification Content

When modifying a static branch, we may modify nop to jmp, or jmp to nop. In most architectures, the nop instruction can be of many byte length. For example, in x86-64, since the jmp is usually 5-byte long, we select a 5-byte nop to do the replacement. This can be used to make sure that we do not mess up with the following instruction.

However, when modifying nop to jmp, which target should be jumped to? This cannot be deduced trivially. As a result, we need another field in JumpEntry to record the address of jump target:

#![allow(unused)]
fn main() {
struct JumpEntry {
    code: &'static Instruction,
    /// Relative address to jump target
    target: usize,
    key: usize,
}
}

To construct such jump entry at static branch, we use the following inline assembly instruction (take x86-64 for example). When using static_branch_likely! and static_branch_unlikely!, the following code snippet will be generated (details may be different).

'my_label {
    ::core::arch::asm!(
        r#"
        2:
        .byte 0x0f,0x1f,0x44,0x00,0x00
        .pushsection __static_keys, "awR"
        .balign 8
        .quad 2b - .
        .quad {0} - .
        .quad {1} + {2} - .
        .popsection
        "#
        label {
            break 'my_label false;
        },
        sym MY_STATIC_KEY,
        const true as usize,
    );
    break 'my_label true;
}

It seems complicated, and let me break it down to explain.

Assembly part

The first line 2: indicate an assembly label, which used to mark the location of current instruction: 0x0f, 0x1f, 0x44, 0x00, 0x00, which is a 5-byte nop instruction.

Then we use .pushsection and .popsection pair to switch to another section (current section is .text, which is used to record instructions), which is used to store the jump entries.

Inside the new section, we use three .quad to define three 8-byte values, which corresponds to three fields of JumpEntry struct. The first 8-byte value is 2b - ., where 2b indicates the nearest label with name 2, which is the address of nop instruction we just defined. The . represents current location, which is the address of this 8-byte value. Using 2b - . to indicate a relative address to the nop instruction, which is the code field of JumpEntry.

The second 8-byte value is {0} - ., where {0} indicates the first argument in this inline assembly, which is label { break 'my_label false; }. This is the target address of jmp instruction, which corresponds to the target field of JumpEntry. This will be explained later.

The third 8-byte value is {1} + {2} - ., which stores information about static key and initial status of static branch (note that static key is always 8-byte aligned, so the last byte of its address is always 0x00, which allows us to use this to record additional information). The initial status will be explained later as well.

By conducting this inline assembly, a jump entry will be generated in "__static_keys" section at compile time.

Jump label part

Since the inline assembly does not affect the control flow, let's simply the code snippet to see more clear about jump label part:

'my_label {
    // Some inline-assembly
    break 'my_label true;
}

This will be treated as a true value expression by Rust compiler. Since we use the static_branch_likely! and static_branch_unlikely! in the if-check, the if-check then become

if true {
    do_a();
} else {
    do_b();
}
do_c();

As a result, the Rust compiler will optimize the instruction to be

nop        ; 0x0f,0x1f,0x44,0x00,0x00
call do_a  ; do_a()

However, do_b() will not be optimized out: there is an argument in the inline assembly which reference it --- the label { break 'my_label false; }. As described above, this argument is used as an address to the statement break 'my_label false;. When fitting this statement into the if-check, it become a false condition. As a result, this statement is translated to a call to do_b(), which this call is never executed in the static control flow. To make it more clear, let's see what is the generated assembly look like:

    nop           ; 0x0f,0x1f,0x44,0x00,0x00
    call    do_a  ; do_a()
DO_C:
    call    do_c  ; do_c()
    ret           ; End of this function
DO_B:
    call    do_b  ; do_b()
    jmp     DO_C  ; goto DO_C

The basic block at DO_B will never be executed in the static control flow, while we do pass its address to a jump entry stored in the individual section.

When we modify the static branch to a jmp, the assembly become:

    jmp     DO_B  ; Modified!
    call    do_a  ; do_a()
DO_C:
    call    do_c  ; do_c()
    ret           ; End of this function
DO_B:
    call    do_b  ; do_b()
    jmp     DO_C  ; goto DO_C

Things go right!

Static Branch Modification Rule

Branch layout

As described in the static branch modification content, there are two branches to be executed: one can be executed after nop, and is adjacent to the main part; another shall be executed with two additional jmp, and its location is in the end of function. This difference will make a little impact on the performance. Usually, the branch that unlikely to be executed should be the latter one, and the other should be the former one. In this crate, the layout is controlled by static_branch_likely! and static_branch_unlikely!.

When using static_branch_likely!, the true branch will become a likely branch, which will be positioned near the main part, and can be just noped to it. The false branch is positioned in some other places, and is involved with two additional jmps.

In the inline assembly, the difference is represented by break 'my_label true or break 'my_label false in the end of block.

Initial instruction

After getting the right branch layout, then which instruction should be the initial instruction generated into the binary? It is used for the situation where, we do not update the static key, then its associated static branches need to take the correct path.

The rule is:

  • For static_branch_likely!

    • If static key is defined with initial value true, then generate nop.
    • If static key is defined with initial value false, then generate jmp.
  • For static_branch_unlikely!

    • If static key is defined with initial value false, then generate nop.
    • If static key is defined with initial value true, then generate jmp.

Modification direction

Another question is, when enabling/disabling a static key, what instruction should we update to? Should we update jmp to nop, or update nop to jmp? To solve this question, we shall use the initial status recorded in the last byte of static key address in key field of JumpEntry.

The initial status is a bool, which indicates whether the likely branch is true branch. As described above, the likely branch should always be adjacent to the main part. And this status is controlled by whether we use static_branch_likely! or static_branch_unlikely!, and the initial value of static key.

Then when modifying static branches, the modification direction can be determined by xoring the new value of static key, and the initial status recorded in jump entry. For example, if the likely branch is true branch, and the new value of static key is true, then we shall update jmp to nop, since we need to execute the block adjacent to the static branch check.

Static Branch Modification Approach

The last question need to be solved, is how to modify static branch.

As a ground knowledge, the instructions are in text section. In most platforms, the text section has executable protection, and is non-writable to avoid attackers to modify instructions to execute malicious logic. This kind of protection mechanism is called DEP (Data Execution Protection) or W^X (Writable Xor eXecutable).

In order to modify static branch instructions, we then need to bypass the DEP in a short moment. This may be dangerous and vulnerable, while the DEP bypassing only happens in the modification of static key. After modification done, the protection is restored. So pay attention to the modification!

FAQs

Why static key should be applied to seldomly changed features?

Two reasons:

  • The modification of static key will impose a potential security risk since we need to bypass DEP. The DEP will be restored after modification done.
  • The modification is less efficient since many syscalls are involved.

Why static keys must only be modified in a single-thread environment?

In userland, it is very complicated to modify an instruction which may be executed by another thread. Linux kernel community once proposed a text_poke syscall, but is still not available nowadays. BTW, Linus doesn't seem to like it, and his reasons do make sense.

Another reason is that we need to manipulate memory protection to bypass DEP, which may involves race condition on the protection itself in multi-thread environment. Mutex may be used to avoid data race, while if cargo resolves multi-version static-key crates dependencies, the mutexes would be duplicated for each version, and this approach is thus useless. This shall be resolved when RFC 1977: public & private dependencies is stabilized. rust-lang/cargo#2363 is also a reference.

Why is nightly Rust required?

We use inline assembly internally, where asm_goto and asm_const feature is required. As long as these two features are stablized, we can use stable Rust then.

Why static_branch_likely! and static_branch_unlikely! are implemented as macros?

Because when passing a static variable to inline assembly as sym argument, it requires the argument to be a static path. We cannot construct such path in a function.

What OS-specific features are required to extend to new OSs?

  • Symbols indicating a custom section's start and stop

    Used for sorting static keys section and iterating stop signs.

  • Approaches to keep custom section retained from linker's GC

  • Approaches to bypass DEP

    Used to update static branch instructions.

What arch-specific features are required to extend to new architectures?

  • A nop instruction with same length as jmp (or can divide the length, e.g. 2-byte nop and 4-byte jmp)
  • Approaches to clear instruction cache in Linux. (Should be added to Evian-Zhang/clear-cache)
  • Inline assembly supported by Rust

Can I use this crate in no_std?

Yes.