Sunday, May 18, 2014

A brief introduction to per-cpu variables


per-cpu variables are widely used in Linux kernel such as per-cpu counters, per-cpu cache. The advantages of per-cpu variables are obvious: for a per-cpu data, we do not need locks to synchronize with other cpus. Without locks, we can gain more performance.

There are two kinds of type of per-cpu variables: static and dynamic. For static variables are defined in build time. Linux provides a DEFINE_PER_CPU macro to defines this per-cpu variables.

#define DEFINE_PER_CPU(type, name)

static DEFINE_PER_CPU(struct delayed_work, vmstat_work);

Dynamic per-cpu variables can be obtained in run-time by __alloc_percpu API. __alloca_percpu returns the per-cpu address of the variable.

void __percpu *__alloc_percpu(size_t size, size_t align)
s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu),2 * sizeof(void *));
One big difference between per-cpu variable and other variable is that we must use per-cpu variable macros to access the real per-cpu variable for a given cpu. Accessing per-cpu variables without through these macros is a bug in Linux kernel programming. We will see the reason later.

Here are two examples of accessing per-cpu variables:

struct vm_event_state *this = &per_cpu(vm_event_states, cpu);

struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); 
Let's take a closer look at the behaviour of Linux per-cpu variables. After we define our static per-cpu variables, the complier will collect all static per-cpu variables to the per-cpu sections. We can see them by 'readelf' or 'nm' tools: 

0000000000000000 D __per_cpu_start
...
000000000000f1c0 d lru_add_drain_work
000000000000f1e0 D vm_event_states
000000000000f420 d vmstat_work
000000000000f4a0 d vmap_block_queue
000000000000f4c0 d vfree_deferred
000000000000f4f0 d memory_failure_cpu
...
0000000000013ac0 D __per_cpu_end
  [15] .vvar             PROGBITS         ffffffff81698000  00898000
       00000000000000f0  0000000000000000  WA       0     0     16
  [16] .data..percpu     PROGBITS         0000000000000000  00a00000
       0000000000013ac0  0000000000000000  WA       0     0     4096
  [17] .init.text        PROGBITS         ffffffff816ad000  00aad000
       000000000003fa21  0000000000000000  AX       0     0     16

You can see our vmstat_work is at 0xf420, which is within __per_cpu_start and __per_cpu_end. The two special symbols (__per_cpu_start and __per_cpu_end) mark the start and end address of the per-cpu section.

One simple question: there are only one entry of vmstat_work in the per-cpu section, but we should have NR_CPUS entries of it. Where are all other vmstat_work entries?

Actually the per-cpu section is just a roadmap of all per-cpu variables. The real body of every per-cpu variable is allocated in a per-cpu chunk at  runt-time. Linux make NR_CPUS copies of static/dynamic varables. To get to those real bodies of per-cpu variables, we use per_cpu or per_cpu_ptr macros.

What per_cpu and per_cpu_ptr do is to add a offset (named __per_cpu_offset) to the given address to reach the read body of the per-cpu variable.

#define per_cpu(var, cpu) \
        (*SHIFT_PERCPU_PTR(&(var), per_cpu_offset(cpu)))

#define per_cpu_offset(x) (__per_cpu_offset[x])

It's easier to understand the idea by a picture: 

Translating a per-cpu variable to its real body (NR_CPUS = 4)

Take a closer look:
There are three part of an unit: static, reserved, and dynamic.
static: the static per-cpu variables. (__per_cpu_end - __per_cpu_start)
reserved: per-cpu slot reserved for kernel modules
dynamic: slots for dynamic allocation (__alloc_percpu)

Unit and chunk

static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
                                size_t reserved_size, size_t dyn_size,
                                size_t atom_size,
                                pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
        static int group_map[NR_CPUS] __initdata;
        static int group_cnt[NR_CPUS] __initdata;
        const size_t static_size = __per_cpu_end - __per_cpu_start;
+-- 12 lines: int nr_groups = 1, nr_units = 0;----------------------
        /* calculate size_sum and ensure dyn_size is enough for early alloc */
        size_sum = PFN_ALIGN(static_size + reserved_size +
                            max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
        dyn_size = size_sum - static_size - reserved_size;
+--108 lines: Determine min_unit_size, alloc_size and max_upa such that--
}
After determining the size of the unit, the chunk is allocated by the memblock APIs.
int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
                                  size_t atom_size,
                                  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
                                  pcpu_fc_alloc_fn_t alloc_fn,
                                  pcpu_fc_free_fn_t free_fn)
{
+-- 20 lines: void *base = (void *)ULONG_MAX;---------------------------------
        /* allocate, copy and determine base address */
        for (group = 0; group < ai->nr_groups; group++) {
                struct pcpu_group_info *gi = &ai->groups[group];
                unsigned int cpu = NR_CPUS;
                void *ptr;

                for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
                        cpu = gi->cpu_map[i];
                BUG_ON(cpu == NR_CPUS);

                /* allocate space for the whole group */
                ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
                if (!ptr) {
                        rc = -ENOMEM;
                        goto out_free_areas;
                }
                /* kmemleak tracks the percpu allocations separately */
                kmemleak_free(ptr);
                areas[group] = ptr;

                base = min(ptr, base);
        }
+-- 60 lines: Copy data and free unused parts.  This should happen after all---
}
static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
                                       size_t align)
{
        return  memblock_virt_alloc_from_nopanic(
                        size, align, __pa(MAX_DMA_ADDRESS));
}

Tuesday, April 8, 2014

printk as a debug tool

printk is a nature and basic tool for debugging kernel. Sometimes it is the only tool we have. Here are some tips of using printk.

1) printk formats

Documentation/printk-formats.txt introduces many useful printk formats. I use %p family the most:


Raw pointer value SHOULD be printed with %p. The kernel supports
the following extended format specifiers for pointer types:

Symbols/Function Pointers:

        %pF     versatile_init+0x0/0x110
        %pf     versatile_init
        %pS     versatile_init+0x0/0x110
        %pSR    versatile_init+0x9/0x110
                (with __builtin_extract_return_addr() translation)
        %ps     versatile_init
        %pB     prev_fn_of_versatile_init+0x88/0x88

2) print_hex_dump

Sometimes I have to create memory dumps. You can use a simple for loop to do that, but Linux kernel provides a better way - print_hex_dump.

For example:
Function prototype:
static inline void print_hex_dump(const char *level, const char *prefix_str,
                                  int prefix_type, int rowsize, int groupsize,
                                  const void *buf, size_t len, bool ascii)
                print_hex_dump(KERN_ALERT, "mem: ", DUMP_PREFIX_ADDRESS,
                                16, 1, p, 512, 1);
output:
mem: ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
mem: ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
mem: ddc866a0: 12 12 12 12 60 6b c8 dd 16 80 99 e0 fa 8e 2a c1  ....`k........*.
mem: ddc866b0: 16 80 99 e0 ce 92 2a c1 16 80 99 e0 f2 c1 1b c1  ......*.........
mem: ddc866c0: 16 80 99 e0 4c 8b 0a c1 4c 8b 0a c1 61 80 99 e0  ....L...L...a...
mem: ddc866d0: 16 80 99 e0 61 80 99 e0 16 80 99 e0 61 80 99 e0  ....a.......a...
mem: ddc866e0: 75 80 99 e0 48 01 00 c1 2b 36 05 c1 00 00 00 00  u...H...+6......
mem: ddc866f0: 4a 0c 00 00 99 ad 06 00 6d 35 05 c1 9e 8b 2a c1  J.......m5....*.
mem: ddc86700: 6d 35 05 c1 48 8c 2a c1 6d 35 05 c1 ee 89 0a c1  m5..H.*.m5......
mem: ddc86710: ee 89 0a c1 e4 0a 14 c1 e4 0a 14 c1 ee 89 0a c1  ................
mem: ddc86720: ee 89 0a c1 6d 35 05 c1 6d 35 05 c1 6d 35 05 c1  ....m5..m5..m5..
mem: ddc86730: a7 39 05 c1 ef b8 2a c1 00 00 00 00 00 00 00 00  .9....*.........
mem: ddc86740: 4a 0c 00 00 97 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  J.......ZZZZZZZZ
mem: ddc86750: 14 dc 46 dd 14 dc 46 dd 00 00 00 00 6b 6b 6b 6b  ..F...F.....kkkk
mem: ddc86760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
mem: ddc86770: cc cc cc cc c0 69 c8 dd a0 83 20 c1 fa 8e 2a c1  .....i.... ...*.

3) pr_alert family

Kernel provides some wrapper macros the different printk levels. I prefer to use the macros because they are more easier to read and less characters to type.
#define pr_emerg(fmt, ...) \
        printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
#define pr_alert(fmt, ...) \
        printk(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_crit(fmt, ...) \
        printk(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_err(fmt, ...) \
        printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
#define pr_warning(fmt, ...) \
        printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
#define pr_warn pr_warning
#define pr_notice(fmt, ...) \
        printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
#define pr_info(fmt, ...) \
        printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
#define pr_cont(fmt, ...) \
        printk(KERN_CONT fmt, ##__VA_ARGS__)

Tuesday, February 25, 2014

rxvt-unicode + tmux do italic instead of reverse

When using rxvt-unicode and tmux, I found that rxvt-unicode shows italic instead of reverse color when I search.

Italic

Reverse
I followed the setting in the article below:

http://sourceforge.net/mailarchive/forum.php?thread_name=20110812111030.GH13508%40plenz.com&forum_name=tmux-users

Modify ~/.tmux.conf and add the line to solve my problem.

set -g terminal-overrides 'rxvt-unicode*:sitm@'

Friday, February 21, 2014

rxvt-unicode - not to select trailing blanks in vim

Everytime I do mouse selection in my vim (run in rxvt-unicode) and paste the content to another place, I will select all trailing blanks in every line. It's a very annoying problem.

Finally I found the solution in

http://www.reddit.com/r/emacs/comments/1ox5pf/why_fill_the_empty_space_with_spaces/

I followed the article and added the following options in my ~/.Xdefault:

URxvt.perl-ext-common: default,selection-autotransform
URxvt.selection-autotransform.0: s/ +$//gm 
It removes trailing blanks by perl and everything works as I want now. Thanks for this article!

Thursday, February 20, 2014

debug with Linux slub allocator


The slub allocator in Linux has useful debug features. Such as poisoning, readzone checking, and allocate/free traces with timestamps. It's very useful during product developing stage. Let's create a kernel module and test the debug features.

Make sure slub allocator is built in your kernel.

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y

The slub allocator creates additional meta data to store allocate/free traces and timestamps. Everytime slub allocator allocate/free an object, it do poison check (data area) and redzone check  (boundry).

The module shows how it happens. It allocates 32 bytes from kernel and we overwrite the redzone by memset 36 bytes.

void try_to_corrupt_redzone(void)
{
        void *p = kmalloc(32, GFP_KERNEL);
        if (p) {
                pr_alert("p: 0x%p\n", p);
                memset(p, 0x12, 36);    /* write too much */
                print_hex_dump(KERN_ALERT, "mem: ", DUMP_PREFIX_ADDRESS,
                                16, 1, p, 512, 1);
                kfree(p);       /* slub.c should catch this error */
        }
}

static int mymodule_init(void)
{
        pr_alert("%s init\n", __FUNCTION__);
        try_to_corrupt_redzone();
        return 0;
}

static void mymodule_exit(void)
{
        pr_alert("%s exit\n", __FUNCTION__);
}

module_init(mymodule_init);
module_exit(mymodule_exit);

After freeing the object, the kernel checks the object and find that the redzone is overwritten and says:

[ 2050.630002] mymodule_init init
[ 2050.630565] p: 0xddc86680
[ 2050.630653] mem: ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.630779] mem: ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.630897] mem: ddc866a0: 12 12 12 12 60 6b c8 dd 16 80 99 e0 fa 8e 2a c1  ....`k........*.
[ 2050.631014] mem: ddc866b0: 16 80 99 e0 ce 92 2a c1 16 80 99 e0 f2 c1 1b c1  ......*.........
[ 2050.631130] mem: ddc866c0: 16 80 99 e0 4c 8b 0a c1 4c 8b 0a c1 61 80 99 e0  ....L...L...a...
[ 2050.631248] mem: ddc866d0: 16 80 99 e0 61 80 99 e0 16 80 99 e0 61 80 99 e0  ....a.......a...
[ 2050.631365] mem: ddc866e0: 75 80 99 e0 48 01 00 c1 2b 36 05 c1 00 00 00 00  u...H...+6......
[ 2050.631483] mem: ddc866f0: 4a 0c 00 00 99 ad 06 00 6d 35 05 c1 9e 8b 2a c1  J.......m5....*.
[ 2050.631599] mem: ddc86700: 6d 35 05 c1 48 8c 2a c1 6d 35 05 c1 ee 89 0a c1  m5..H.*.m5......
[ 2050.631716] mem: ddc86710: ee 89 0a c1 e4 0a 14 c1 e4 0a 14 c1 ee 89 0a c1  ................
[ 2050.631832] mem: ddc86720: ee 89 0a c1 6d 35 05 c1 6d 35 05 c1 6d 35 05 c1  ....m5..m5..m5..
[ 2050.631948] mem: ddc86730: a7 39 05 c1 ef b8 2a c1 00 00 00 00 00 00 00 00  .9....*.........
[ 2050.633948] mem: ddc86740: 4a 0c 00 00 97 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  J.......ZZZZZZZZ
[ 2050.634095] mem: ddc86750: 14 dc 46 dd 14 dc 46 dd 00 00 00 00 6b 6b 6b 6b  ..F...F.....kkkk
[ 2050.634236] mem: ddc86760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[ 2050.634378] mem: ddc86770: cc cc cc cc c0 69 c8 dd a0 83 20 c1 fa 8e 2a c1  .....i.... ...*.
[ 2050.634629] =============================================================================
[ 2050.634750] BUG kmalloc-32 (Tainted: P    B      O): Redzone overwritten
[ 2050.634828] -----------------------------------------------------------------------------
[ 2050.634828] 
[ 2050.634967] INFO: 0xddc866a0-0xddc866a3. First byte 0x12 instead of 0xcc
[ 2050.635123] INFO: Allocated in try_to_corrupt_redzone+0x16/0x61 [mymodule] age=1 cpu=0 pid=3146
[ 2050.635255]  alloc_debug_processing+0x63/0xd1
[ 2050.635337]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635423]  __slab_alloc.constprop.73+0x366/0x384
[ 2050.635506]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635594]  vt_console_print+0x21e/0x226
[ 2050.635672]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635758]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635832]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635909]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.635992]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636003]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636092]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636179]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636261]  mymodule_init+0x14/0x19 [mymodule]
[ 2050.636343]  do_one_initcall+0x6c/0xf4
[ 2050.636428]  load_module+0x1690/0x199a
[ 2050.636508] INFO: Freed in load_module+0x15d2/0x199a age=3 cpu=0 pid=3146
[ 2050.636598]  free_debug_processing+0xd6/0x142
[ 2050.636676]  load_module+0x15d2/0x199a
[ 2050.636749]  __slab_free+0x3e/0x28d
[ 2050.636819]  load_module+0x15d2/0x199a
[ 2050.636888]  kfree+0xe4/0x102
[ 2050.636953]  kfree+0xe4/0x102
[ 2050.637020]  kobject_uevent_env+0x361/0x39a
[ 2050.637091]  kobject_uevent_env+0x361/0x39a
[ 2050.637163]  kfree+0xe4/0x102
[ 2050.637227]  kfree+0xe4/0x102
[ 2050.637294]  load_module+0x15d2/0x199a
[ 2050.637366]  load_module+0x15d2/0x199a
[ 2050.637438]  load_module+0x15d2/0x199a
[ 2050.637509]  SyS_init_module+0x72/0x8a
[ 2050.637581]  syscall_call+0x7/0xb
[ 2050.637649] INFO: Slab 0xdffa90c0 objects=19 used=8 fp=0xddc86000 flags=0x40000080
[ 2050.637749] INFO: Object 0xddc86680 @offset=1664 fp=0xddc86b60
[ 2050.637749] 
[ 2050.637875] Bytes b4 ddc86670: 14 01 00 00 95 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
[ 2050.637875] Object ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Object ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Redzone ddc866a0: 12 12 12 12                                      ....
[ 2050.637875] Padding ddc86748: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
[ 2050.637875] CPU: 0 PID: 3146 Comm: insmod Tainted: P    B      O 3.10.17 #1
[ 2050.637875] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 2050.637875]  00000000 c10a7b59 c10941c5 dffa90c0 ddc86680 de8012cc de801280 ddc86680
[ 2050.637875]  dffa90c0 c10a7bd3 c13689a5 ddc866a0 000000cc 00000004 de801280 ddc86680
[ 2050.637875]  dffa90c0 de800e00 c12a8b2f 000000cc ddc86680 de801280 dffa90c0 dd407e50
[ 2050.637875] Call Trace:
[ 2050.637875]  [&ltc10a7b59&gt] ? check_bytes_and_report+0x6d/0xb0
[ 2050.637875]  [&ltc10941c5&gt] ? page_address+0x1a/0x79
[ 2050.637875]  [&ltc10a7bd3&gt] ? check_object+0x37/0x149
[ 2050.637875]  [&ltc12a8b2f&gt] ? free_debug_processing+0x67/0x142
[ 2050.637875]  [&ltc12a8c48&gt] ? __slab_free+0x3e/0x28d
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&ltc102063d&gt] ? wake_up_klogd+0x1d/0x1e
[ 2050.637875]  [&ltc10a89ee&gt] ? kfree+0xe4/0x102
[ 2050.637875]  [&ltc10a89ee&gt] ? kfree+0xe4/0x102
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&lte0998061&gt] ? try_to_corrupt_redzone+0x61/0x61 [mymodule]
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&ltc1000148&gt] ? do_one_initcall+0x6c/0xf4
[ 2050.637875]  [&ltc105362b&gt] ? load_module+0x1690/0x199a
[ 2050.637875]  [&ltc10539a7&gt] ? SyS_init_module+0x72/0x8a
[ 2050.637875]  [&ltc12ab8ef&gt] ? syscall_call+0x7/0xb
[ 2050.637875] FIX kmalloc-32: Restoring 0xddc866a0-0xddc866a3=0xcc
[ 2050.637875] 
[ 2051.232817] mymodule_exit exit

First the slub allocator print the error type "redzone overwritten"
[ 2050.634629] =============================================================================
[ 2050.634750] BUG kmalloc-32 (Tainted: P    B      O): Redzone overwritten
[ 2050.634828] -----------------------------------------------------------------------------
[ 2050.634828] 
[ 2050.634967] INFO: 0xddc866a0-0xddc866a3. First byte 0x12 instead of 0xcc

To understand what readzone is, take a look at the memory content around the object:

[ 2050.637875] Bytes b4 ddc86670: 14 01 00 00 95 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
[ 2050.637875] Object ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Object ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Redzone ddc866a0: 12 12 12 12                                      ....
[ 2050.637875] Padding ddc86748: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

We fill 38 bytes of 0x12 from the start of the 36-bytes object (0xddc86680 - 0xddc8669f) and 4 more 0x12 on the redzone (normal 0xbb or 0xcc). When the object is returned to the kernel, kernel finds that the redzone is neither 0xcc or 0xbb and reports this as a BUG.

The slub allocator reports the latest allocate/free history of this object. You can see the object is just allocated by our kernel module function 'try_to_corrup_redzone'.

Sometime the traces of the object are more useful than function backtrace. For example, if there exists an use-after-free case:  function A allocates an object and writes if after freeing the object. If the object is allocated by another function B. In this case, function B has a corrupted object, and if we have the free trace of this object, we can trace back to the previous owner of the object, function A.

[ 2050.635123] INFO: Allocated in try_to_corrupt_redzone+0x16/0x61 [mymodule] age=1 cpu=0 pid=3146
[ 2050.635255]  alloc_debug_processing+0x63/0xd1
[ 2050.635337]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635423]  __slab_alloc.constprop.73+0x366/0x384
[ 2050.635506]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635594]  vt_console_print+0x21e/0x226
[ 2050.635672]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635758]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635832]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635909]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.635992]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636003]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636092]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636179]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636261]  mymodule_init+0x14/0x19 [mymodule]
[ 2050.636343]  do_one_initcall+0x6c/0xf4
[ 2050.636428]  load_module+0x1690/0x199a
[ 2050.636508] INFO: Freed in load_module+0x15d2/0x199a age=3 cpu=0 pid=3146
[ 2050.636598]  free_debug_processing+0xd6/0x142
[ 2050.636676]  load_module+0x15d2/0x199a
[ 2050.636749]  __slab_free+0x3e/0x28d
[ 2050.636819]  load_module+0x15d2/0x199a
[ 2050.636888]  kfree+0xe4/0x102
[ 2050.636953]  kfree+0xe4/0x102
[ 2050.637020]  kobject_uevent_env+0x361/0x39a
[ 2050.637091]  kobject_uevent_env+0x361/0x39a
[ 2050.637163]  kfree+0xe4/0x102
[ 2050.637227]  kfree+0xe4/0x102
[ 2050.637294]  load_module+0x15d2/0x199a
[ 2050.637366]  load_module+0x15d2/0x199a
[ 2050.637438]  load_module+0x15d2/0x199a
[ 2050.637509]  SyS_init_module+0x72/0x8a

Saturday, February 15, 2014

ARM64 Linux kernel virtual address space


Now let's talk about the Linux kernel virtual address space on 64-bit ARM CPU. You can find information about ARMv8 in ARM official website. http://www.arm.com/products/processors/armv8-architecture.php

One big problem on 32-bit CPUs is the limited 4GB limitation of virtual address spaces. The problem remains even if some PAE support since it focuses on the extension of physical address space not virtual address space. Things changes after the born of 64-bit CPUs: AMD64 and ARMv8, they can now support up to 2^64 addresses, which is uhh.. a very big number.
Actually 2^64 is too large, so in the Linux kernel implementation, only part of 64 bits are used (42 bits for CONFIG_ARM64_64K_PAGES, 39 bit for 4K page). This article is assuming 4K page is used (VA_BITS = 39 case)

#ifdef CONFIG_ARM64_64K_PAGES
#define VA_BITS                 (42)
#else
#define VA_BITS                 (39)
#endif

One good thing on ARM64 is that since we have enough virtual address bits, user space and kernel space can have their own 2^39 = 512GB virtual addresses!
All user virtual addresses have 25 leading zeros and kernel addresses have 25 leading ones. Address between user space and kernel space are not used and they are used to trap illegal accesses.

ARM64 Linux virtual address space layout

kernel space:

Although we have no ARM64 environment now, we can analysis the kernel virtual address space by reading the source code and observing a running AMD64 Linux box.

In arch/arm64/include/asm/memory.h, we can see the some differences: we have no lowmem zone, since the virtual address is so big that we can treat all memory of lowmem and do not have to worry about virtual address. (Yes, there is still a limit of kernel virtual address). Second, the order of different kernel virtual address changes:


#ifdef CONFIG_ARM64_64K_PAGES
#define VA_BITS                 (42)
#else                               
#define VA_BITS                 (39)
#endif                              
#define PAGE_OFFSET             (UL(0xffffffffffffffff) << (VA_BITS - 1))
#define MODULES_END             (PAGE_OFFSET)
#define MODULES_VADDR           (MODULES_END - SZ_64M)
#define EARLYCON_IOBASE         (MODULES_VADDR - SZ_4M)


         pr_notice("Virtual kernel memory layout:\n"                             
                   "    vmalloc : 0x%16lx - 0x%16lx   (%6ld MB)\n"
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
                   "    vmemmap : 0x%16lx - 0x%16lx   (%6ld MB)\n"
 #endif
                   "    modules : 0x%16lx - 0x%16lx   (%6ld MB)\n"
                   "    memory  : 0x%16lx - 0x%16lx   (%6ld MB)\n"
                   "      .init : 0x%p" " - 0x%p" "   (%6ld kB)\n"
                   "      .text : 0x%p" " - 0x%p" "   (%6ld kB)\n"
                   "      .data : 0x%p" " - 0x%p" "   (%6ld kB)\n",
                   MLM(VMALLOC_START, VMALLOC_END),
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
                   MLM((unsigned long)virt_to_page(PAGE_OFFSET),
                       (unsigned long)virt_to_page(high_memory)),
 #endif
                   MLM(MODULES_VADDR, MODULES_END),
                   MLM(PAGE_OFFSET, (unsigned long)high_memory),

                   MLK_ROUNDUP(__init_begin, __init_end),
                   MLK_ROUNDUP(_text, _etext),
                   MLK_ROUNDUP(_sdata, _edata));

see also:
arch/arm64/mm/init.c
arch/arm64/include/asm/pgtable.h

You can see that there is no pkmap or fixmap, it's because the kernel is assuming every memory has a valid kernel virtual address and there's no need to create pkmap/fixmap.

ARM64 kernel virtual address space layout


User space:
The memory layout implementation of user virtual address space looks like it does on ARM32. Since the available user space virtual address becomes 512GB, we can build a larger application on 64-bit CPUs.

One interesting topic is that ARM claims the ARMv8 is compatible with ARM 32-bit applications, all 32-bit applications can run on ARMv8 without modification.How does the 32-bit application virtual memory layout look like on a 64-bit kernel?
Actually, all process on 64-bit kernel is a 64-bit process. To run ARM 32-bit applications, Linux kernel still create a process from a 64-bit init process, but limit the user address space to 4GB. In this way, we can have both 32-bit and 64-bit application on a 64-bit Linux kernel.


 #ifdef CONFIG_COMPAT
 #define TASK_SIZE_32            UL(0x100000000)
 #define TASK_SIZE               (test_thread_flag(TIF_32BIT) ? \
                                 TASK_SIZE_32 : TASK_SIZE_64)
 #else
 #define TASK_SIZE               TASK_SIZE_64
 #endif /* CONFIG_COMPAT */

64-bit ARM applications on 64-bit Linux kernel

ARM64 64-bit user space program virtual address space layout


32-bit ARM applications on 64-bit Linux kernel

ARM64 32-bit user space program virtual address space layout

Note that the 32-bit application still have a 512GB kernel virtual address space and do not share it's own 4GB of virtual address space with kernel, the user applications have a complete 4GB of virtual address. On the other hand, 32-bit applications on 32-bit kernel have only 3GB of virtual address space.


ARM32 Linux ARM64 Linux
32-bit user virtual address space size 3GB 4GB
64-bit user virtual address space size N/A 512GB
kernel virtual address space 1GB 512GB

ARM32 Linux kernel virtual address space

The 32-bit ARM CPU can address up to 2^32 = 4GB address*. It's not big enough in present days, since the size of available DRAM on computing devices is growing fast and the memory usage of application is growing as well.

In Linux kernel implementation, user space and kernel must coexist in the same 4GB virtual address space. It means both user space and kernel can use less than 4GB virtual address space.
Linux kernel provides 3 different split of virtual address spaces: VMSPLIT_3G, VMSPLIT_2G, VMSPLIT_1G.


Linux virtual address space options


 The default configuration is VMSPLIT_3G, as you can see, kernel space starts from 0xC0000000 to 0xFFFFFFFF and user space starts from 0x00000000 to 0xC0000000.

Let's take a closer look of the VMSPLIT_3G mapping:

kernel space

We can observe the kernel virtual address by checking the boot log (dmesg) or take a look at arch/arm/mm/init.c.
lowmem: The memory that have 1-to-1 mapping between virtual and physical address. It means the virtual and physical address are both configuous, and this good property makes the virtual to physical address translation very easy. If we have a virtual address from lowmem, we can find out its physical address by simple shift. (see __pa() and __va()).

vmalloc: The vmalloc memory is only virtually contiguous.

fixmap/pkmap: create fast mapping of a single page for kernel. Most used in file system.

modules: The virtual address for module loading and executing. kernel modules are loaded into this part of virtual memory.

user space

The code for deterring user space virtual address is in arch/arm/mm/mmap.c
The user space have two different kind of mmap layout: legacy and non-legacy. Legacy layout sets the base of mmap(TASK_UNMAPPED_BASE) and the mmap grows in bottom-up manner; on the other case, non-legacy set the mmap base from TASK_SIZE - 128MB with some random shift for security reasons).


void arch_pick_mmap_layout(struct mm_struct *mm)
{
        unsigned long random_factor = 0UL;

        /* 8 bits of randomness in 20 address space bits */
        if ((current->flags & PF_RANDOMIZE) &&
            !(current->personality & ADDR_NO_RANDOMIZE))
                random_factor = (get_random_int() % (1 << 8)) << PAGE_SHIFT;
        if (mmap_is_legacy()) {
                mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
                mm->get_unmapped_area = arch_get_unmapped_area;
        } else {
                mm->mmap_base = mmap_base(random_factor);
                mm->get_unmapped_area = arch_get_unmapped_area_topdown;
        }

The user space virtual address layout looks like:

32-bit user virtual address space layout

*ARM has LPAE (Large Physical Address Extension) mode that can address up to 1TB.