thinkiii: linux

Showing posts with label linux. Show all posts

Sunday, May 18, 2014

A brief introduction to per-cpu variables

per-cpu variables are widely used in Linux kernel such as per-cpu counters, per-cpu cache. The advantages of per-cpu variables are obvious: for a per-cpu data, we do not need locks to synchronize with other cpus. Without locks, we can gain more performance.

There are two kinds of type of per-cpu variables: static and dynamic. For static variables are defined in build time. Linux provides a DEFINE_PER_CPU macro to defines this per-cpu variables.

#define DEFINE_PER_CPU(type, name)

static DEFINE_PER_CPU(struct delayed_work, vmstat_work);

Dynamic per-cpu variables can be obtained in run-time by __alloc_percpu API. __alloca_percpu returns the per-cpu address of the variable.

void __percpu *__alloc_percpu(size_t size, size_t align)
s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu),2 * sizeof(void *));

One big difference between per-cpu variable and other variable is that we must use per-cpu variable macros to access the real per-cpu variable for a given cpu. Accessing per-cpu variables without through these macros is a bug in Linux kernel programming. We will see the reason later.

Here are two examples of accessing per-cpu variables:

struct vm_event_state *this = &per_cpu(vm_event_states, cpu);

struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

Let's take a closer look at the behaviour of Linux per-cpu variables. After we define our static per-cpu variables, the complier will collect all static per-cpu variables to the per-cpu sections. We can see them by 'readelf' or 'nm' tools:

0000000000000000 D __per_cpu_start
...
000000000000f1c0 d lru_add_drain_work
000000000000f1e0 D vm_event_states
000000000000f420 d vmstat_work
000000000000f4a0 d vmap_block_queue
000000000000f4c0 d vfree_deferred
000000000000f4f0 d memory_failure_cpu
...
0000000000013ac0 D __per_cpu_end

  [15] .vvar             PROGBITS         ffffffff81698000  00898000
       00000000000000f0  0000000000000000  WA       0     0     16
  [16] .data..percpu     PROGBITS         0000000000000000  00a00000
       0000000000013ac0  0000000000000000  WA       0     0     4096
  [17] .init.text        PROGBITS         ffffffff816ad000  00aad000
       000000000003fa21  0000000000000000  AX       0     0     16

You can see our vmstat_work is at 0xf420, which is within __per_cpu_start and __per_cpu_end. The two special symbols (__per_cpu_start and __per_cpu_end) mark the start and end address of the per-cpu section.

One simple question: there are only one entry of vmstat_work in the per-cpu section, but we should have NR_CPUS entries of it. Where are all other vmstat_work entries?

Actually the per-cpu section is just a roadmap of all per-cpu variables. The real body of every per-cpu variable is allocated in a per-cpu chunk at runt-time. Linux make NR_CPUS copies of static/dynamic varables. To get to those real bodies of per-cpu variables, we use per_cpu or per_cpu_ptr macros.

What per_cpu and per_cpu_ptr do is to add a offset (named __per_cpu_offset) to the given address to reach the read body of the per-cpu variable.

#define per_cpu(var, cpu) \
        (*SHIFT_PERCPU_PTR(&(var), per_cpu_offset(cpu)))

#define per_cpu_offset(x) (__per_cpu_offset[x])

It's easier to understand the idea by a picture:

Translating a per-cpu variable to its real body (NR_CPUS = 4)

Take a closer look:
There are three part of an unit: static, reserved, and dynamic.
static: the static per-cpu variables. (__per_cpu_end - __per_cpu_start)
reserved: per-cpu slot reserved for kernel modules
dynamic: slots for dynamic allocation (__alloc_percpu)

Unit and chunk

static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
                                size_t reserved_size, size_t dyn_size,
                                size_t atom_size,
                                pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
        static int group_map[NR_CPUS] __initdata;
        static int group_cnt[NR_CPUS] __initdata;
        const size_t static_size = __per_cpu_end - __per_cpu_start;
+-- 12 lines: int nr_groups = 1, nr_units = 0;----------------------
        /* calculate size_sum and ensure dyn_size is enough for early alloc */
        size_sum = PFN_ALIGN(static_size + reserved_size +
                            max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
        dyn_size = size_sum - static_size - reserved_size;
+--108 lines: Determine min_unit_size, alloc_size and max_upa such that--
}

After determining the size of the unit, the chunk is allocated by the memblock APIs.

int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
                                  size_t atom_size,
                                  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
                                  pcpu_fc_alloc_fn_t alloc_fn,
                                  pcpu_fc_free_fn_t free_fn)
{
+-- 20 lines: void *base = (void *)ULONG_MAX;---------------------------------
        /* allocate, copy and determine base address */
        for (group = 0; group < ai->nr_groups; group++) {
                struct pcpu_group_info *gi = &ai->groups[group];
                unsigned int cpu = NR_CPUS;
                void *ptr;

                for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
                        cpu = gi->cpu_map[i];
                BUG_ON(cpu == NR_CPUS);

                /* allocate space for the whole group */
                ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
                if (!ptr) {
                        rc = -ENOMEM;
                        goto out_free_areas;
                }
                /* kmemleak tracks the percpu allocations separately */
                kmemleak_free(ptr);
                areas[group] = ptr;

                base = min(ptr, base);
        }
+-- 60 lines: Copy data and free unused parts.  This should happen after all---
}

static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
                                       size_t align)
{
        return  memblock_virt_alloc_from_nopanic(
                        size, align, __pa(MAX_DMA_ADDRESS));
}

Tuesday, April 8, 2014

printk as a debug tool

printk is a nature and basic tool for debugging kernel. Sometimes it is the only tool we have. Here are some tips of using printk.

1) printk formats

Documentation/printk-formats.txt introduces many useful printk formats. I use %p family the most:

Raw pointer value SHOULD be printed with %p. The kernel supports
the following extended format specifiers for pointer types:

Symbols/Function Pointers:

        %pF     versatile_init+0x0/0x110
        %pf     versatile_init
        %pS     versatile_init+0x0/0x110
        %pSR    versatile_init+0x9/0x110
                (with __builtin_extract_return_addr() translation)
        %ps     versatile_init
        %pB     prev_fn_of_versatile_init+0x88/0x88

2) print_hex_dump

Sometimes I have to create memory dumps. You can use a simple for loop to do that, but Linux kernel provides a better way - print_hex_dump.

For example:
Function prototype:

static inline void print_hex_dump(const char *level, const char *prefix_str,
                                  int prefix_type, int rowsize, int groupsize,
                                  const void *buf, size_t len, bool ascii)

                print_hex_dump(KERN_ALERT, "mem: ", DUMP_PREFIX_ADDRESS,
                                16, 1, p, 512, 1);

output:

mem: ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
mem: ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
mem: ddc866a0: 12 12 12 12 60 6b c8 dd 16 80 99 e0 fa 8e 2a c1  ....`k........*.
mem: ddc866b0: 16 80 99 e0 ce 92 2a c1 16 80 99 e0 f2 c1 1b c1  ......*.........
mem: ddc866c0: 16 80 99 e0 4c 8b 0a c1 4c 8b 0a c1 61 80 99 e0  ....L...L...a...
mem: ddc866d0: 16 80 99 e0 61 80 99 e0 16 80 99 e0 61 80 99 e0  ....a.......a...
mem: ddc866e0: 75 80 99 e0 48 01 00 c1 2b 36 05 c1 00 00 00 00  u...H...+6......
mem: ddc866f0: 4a 0c 00 00 99 ad 06 00 6d 35 05 c1 9e 8b 2a c1  J.......m5....*.
mem: ddc86700: 6d 35 05 c1 48 8c 2a c1 6d 35 05 c1 ee 89 0a c1  m5..H.*.m5......
mem: ddc86710: ee 89 0a c1 e4 0a 14 c1 e4 0a 14 c1 ee 89 0a c1  ................
mem: ddc86720: ee 89 0a c1 6d 35 05 c1 6d 35 05 c1 6d 35 05 c1  ....m5..m5..m5..
mem: ddc86730: a7 39 05 c1 ef b8 2a c1 00 00 00 00 00 00 00 00  .9....*.........
mem: ddc86740: 4a 0c 00 00 97 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  J.......ZZZZZZZZ
mem: ddc86750: 14 dc 46 dd 14 dc 46 dd 00 00 00 00 6b 6b 6b 6b  ..F...F.....kkkk
mem: ddc86760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
mem: ddc86770: cc cc cc cc c0 69 c8 dd a0 83 20 c1 fa 8e 2a c1  .....i.... ...*.

3) pr_alert family

Kernel provides some wrapper macros the different printk levels. I prefer to use the macros because they are more easier to read and less characters to type.

#define pr_emerg(fmt, ...) \
        printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
#define pr_alert(fmt, ...) \
        printk(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_crit(fmt, ...) \
        printk(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_err(fmt, ...) \
        printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
#define pr_warning(fmt, ...) \
        printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
#define pr_warn pr_warning
#define pr_notice(fmt, ...) \
        printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
#define pr_info(fmt, ...) \
        printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
#define pr_cont(fmt, ...) \
        printk(KERN_CONT fmt, ##__VA_ARGS__)

Thursday, February 20, 2014

debug with Linux slub allocator

The slub allocator in Linux has useful debug features. Such as poisoning, readzone checking, and allocate/free traces with timestamps. It's very useful during product developing stage. Let's create a kernel module and test the debug features.

Make sure slub allocator is built in your kernel.

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y

The slub allocator creates additional meta data to store allocate/free traces and timestamps. Everytime slub allocator allocate/free an object, it do poison check (data area) and redzone check (boundry).

The module shows how it happens. It allocates 32 bytes from kernel and we overwrite the redzone by memset 36 bytes.

void try_to_corrupt_redzone(void)
{
        void *p = kmalloc(32, GFP_KERNEL);
        if (p) {
                pr_alert("p: 0x%p\n", p);
                memset(p, 0x12, 36);    /* write too much */
                print_hex_dump(KERN_ALERT, "mem: ", DUMP_PREFIX_ADDRESS,
                                16, 1, p, 512, 1);
                kfree(p);       /* slub.c should catch this error */
        }
}

static int mymodule_init(void)
{
        pr_alert("%s init\n", __FUNCTION__);
        try_to_corrupt_redzone();
        return 0;
}

static void mymodule_exit(void)
{
        pr_alert("%s exit\n", __FUNCTION__);
}

module_init(mymodule_init);
module_exit(mymodule_exit);

After freeing the object, the kernel checks the object and find that the redzone is overwritten and says:

[ 2050.630002] mymodule_init init
[ 2050.630565] p: 0xddc86680
[ 2050.630653] mem: ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.630779] mem: ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.630897] mem: ddc866a0: 12 12 12 12 60 6b c8 dd 16 80 99 e0 fa 8e 2a c1  ....`k........*.
[ 2050.631014] mem: ddc866b0: 16 80 99 e0 ce 92 2a c1 16 80 99 e0 f2 c1 1b c1  ......*.........
[ 2050.631130] mem: ddc866c0: 16 80 99 e0 4c 8b 0a c1 4c 8b 0a c1 61 80 99 e0  ....L...L...a...
[ 2050.631248] mem: ddc866d0: 16 80 99 e0 61 80 99 e0 16 80 99 e0 61 80 99 e0  ....a.......a...
[ 2050.631365] mem: ddc866e0: 75 80 99 e0 48 01 00 c1 2b 36 05 c1 00 00 00 00  u...H...+6......
[ 2050.631483] mem: ddc866f0: 4a 0c 00 00 99 ad 06 00 6d 35 05 c1 9e 8b 2a c1  J.......m5....*.
[ 2050.631599] mem: ddc86700: 6d 35 05 c1 48 8c 2a c1 6d 35 05 c1 ee 89 0a c1  m5..H.*.m5......
[ 2050.631716] mem: ddc86710: ee 89 0a c1 e4 0a 14 c1 e4 0a 14 c1 ee 89 0a c1  ................
[ 2050.631832] mem: ddc86720: ee 89 0a c1 6d 35 05 c1 6d 35 05 c1 6d 35 05 c1  ....m5..m5..m5..
[ 2050.631948] mem: ddc86730: a7 39 05 c1 ef b8 2a c1 00 00 00 00 00 00 00 00  .9....*.........
[ 2050.633948] mem: ddc86740: 4a 0c 00 00 97 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  J.......ZZZZZZZZ
[ 2050.634095] mem: ddc86750: 14 dc 46 dd 14 dc 46 dd 00 00 00 00 6b 6b 6b 6b  ..F...F.....kkkk
[ 2050.634236] mem: ddc86760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[ 2050.634378] mem: ddc86770: cc cc cc cc c0 69 c8 dd a0 83 20 c1 fa 8e 2a c1  .....i.... ...*.
[ 2050.634629] =============================================================================
[ 2050.634750] BUG kmalloc-32 (Tainted: P    B      O): Redzone overwritten
[ 2050.634828] -----------------------------------------------------------------------------
[ 2050.634828] 
[ 2050.634967] INFO: 0xddc866a0-0xddc866a3. First byte 0x12 instead of 0xcc
[ 2050.635123] INFO: Allocated in try_to_corrupt_redzone+0x16/0x61 [mymodule] age=1 cpu=0 pid=3146
[ 2050.635255]  alloc_debug_processing+0x63/0xd1
[ 2050.635337]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635423]  __slab_alloc.constprop.73+0x366/0x384
[ 2050.635506]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635594]  vt_console_print+0x21e/0x226
[ 2050.635672]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635758]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635832]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635909]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.635992]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636003]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636092]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636179]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636261]  mymodule_init+0x14/0x19 [mymodule]
[ 2050.636343]  do_one_initcall+0x6c/0xf4
[ 2050.636428]  load_module+0x1690/0x199a
[ 2050.636508] INFO: Freed in load_module+0x15d2/0x199a age=3 cpu=0 pid=3146
[ 2050.636598]  free_debug_processing+0xd6/0x142
[ 2050.636676]  load_module+0x15d2/0x199a
[ 2050.636749]  __slab_free+0x3e/0x28d
[ 2050.636819]  load_module+0x15d2/0x199a
[ 2050.636888]  kfree+0xe4/0x102
[ 2050.636953]  kfree+0xe4/0x102
[ 2050.637020]  kobject_uevent_env+0x361/0x39a
[ 2050.637091]  kobject_uevent_env+0x361/0x39a
[ 2050.637163]  kfree+0xe4/0x102
[ 2050.637227]  kfree+0xe4/0x102
[ 2050.637294]  load_module+0x15d2/0x199a
[ 2050.637366]  load_module+0x15d2/0x199a
[ 2050.637438]  load_module+0x15d2/0x199a
[ 2050.637509]  SyS_init_module+0x72/0x8a
[ 2050.637581]  syscall_call+0x7/0xb
[ 2050.637649] INFO: Slab 0xdffa90c0 objects=19 used=8 fp=0xddc86000 flags=0x40000080
[ 2050.637749] INFO: Object 0xddc86680 @offset=1664 fp=0xddc86b60
[ 2050.637749] 
[ 2050.637875] Bytes b4 ddc86670: 14 01 00 00 95 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
[ 2050.637875] Object ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Object ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Redzone ddc866a0: 12 12 12 12                                      ....
[ 2050.637875] Padding ddc86748: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
[ 2050.637875] CPU: 0 PID: 3146 Comm: insmod Tainted: P    B      O 3.10.17 #1
[ 2050.637875] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 2050.637875]  00000000 c10a7b59 c10941c5 dffa90c0 ddc86680 de8012cc de801280 ddc86680
[ 2050.637875]  dffa90c0 c10a7bd3 c13689a5 ddc866a0 000000cc 00000004 de801280 ddc86680
[ 2050.637875]  dffa90c0 de800e00 c12a8b2f 000000cc ddc86680 de801280 dffa90c0 dd407e50
[ 2050.637875] Call Trace:
[ 2050.637875]  [&ltc10a7b59&gt] ? check_bytes_and_report+0x6d/0xb0
[ 2050.637875]  [&ltc10941c5&gt] ? page_address+0x1a/0x79
[ 2050.637875]  [&ltc10a7bd3&gt] ? check_object+0x37/0x149
[ 2050.637875]  [&ltc12a8b2f&gt] ? free_debug_processing+0x67/0x142
[ 2050.637875]  [&ltc12a8c48&gt] ? __slab_free+0x3e/0x28d
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&ltc102063d&gt] ? wake_up_klogd+0x1d/0x1e
[ 2050.637875]  [&ltc10a89ee&gt] ? kfree+0xe4/0x102
[ 2050.637875]  [&ltc10a89ee&gt] ? kfree+0xe4/0x102
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&lte0998061&gt] ? try_to_corrupt_redzone+0x61/0x61 [mymodule]
[ 2050.637875]  [&lte0998075&gt] ? mymodule_init+0x14/0x19 [mymodule]
[ 2050.637875]  [&ltc1000148&gt] ? do_one_initcall+0x6c/0xf4
[ 2050.637875]  [&ltc105362b&gt] ? load_module+0x1690/0x199a
[ 2050.637875]  [&ltc10539a7&gt] ? SyS_init_module+0x72/0x8a
[ 2050.637875]  [&ltc12ab8ef&gt] ? syscall_call+0x7/0xb
[ 2050.637875] FIX kmalloc-32: Restoring 0xddc866a0-0xddc866a3=0xcc
[ 2050.637875] 
[ 2051.232817] mymodule_exit exit

First the slub allocator print the error type "redzone overwritten"

[ 2050.634629] =============================================================================
[ 2050.634750] BUG kmalloc-32 (Tainted: P    B      O): Redzone overwritten
[ 2050.634828] -----------------------------------------------------------------------------
[ 2050.634828] 
[ 2050.634967] INFO: 0xddc866a0-0xddc866a3. First byte 0x12 instead of 0xcc

To understand what readzone is, take a look at the memory content around the object:

[ 2050.637875] Bytes b4 ddc86670: 14 01 00 00 95 ad 06 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
[ 2050.637875] Object ddc86680: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Object ddc86690: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12  ................
[ 2050.637875] Redzone ddc866a0: 12 12 12 12                                      ....
[ 2050.637875] Padding ddc86748: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

We fill 38 bytes of 0x12 from the start of the 36-bytes object (0xddc86680 - 0xddc8669f) and 4 more 0x12 on the redzone (normal 0xbb or 0xcc). When the object is returned to the kernel, kernel finds that the redzone is neither 0xcc or 0xbb and reports this as a BUG.

The slub allocator reports the latest allocate/free history of this object. You can see the object is just allocated by our kernel module function 'try_to_corrup_redzone'.

Sometime the traces of the object are more useful than function backtrace. For example, if there exists an use-after-free case: function A allocates an object and writes if after freeing the object. If the object is allocated by another function B. In this case, function B has a corrupted object, and if we have the free trace of this object, we can trace back to the previous owner of the object, function A.

[ 2050.635123] INFO: Allocated in try_to_corrupt_redzone+0x16/0x61 [mymodule] age=1 cpu=0 pid=3146
[ 2050.635255]  alloc_debug_processing+0x63/0xd1
[ 2050.635337]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635423]  __slab_alloc.constprop.73+0x366/0x384
[ 2050.635506]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635594]  vt_console_print+0x21e/0x226
[ 2050.635672]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.635758]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635832]  kmem_cache_alloc_trace+0x43/0xd7
[ 2050.635909]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.635992]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636003]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636092]  try_to_corrupt_redzone+0x16/0x61 [mymodule]
[ 2050.636179]  mymodule_init+0x0/0x19 [mymodule]
[ 2050.636261]  mymodule_init+0x14/0x19 [mymodule]
[ 2050.636343]  do_one_initcall+0x6c/0xf4
[ 2050.636428]  load_module+0x1690/0x199a
[ 2050.636508] INFO: Freed in load_module+0x15d2/0x199a age=3 cpu=0 pid=3146
[ 2050.636598]  free_debug_processing+0xd6/0x142
[ 2050.636676]  load_module+0x15d2/0x199a
[ 2050.636749]  __slab_free+0x3e/0x28d
[ 2050.636819]  load_module+0x15d2/0x199a
[ 2050.636888]  kfree+0xe4/0x102
[ 2050.636953]  kfree+0xe4/0x102
[ 2050.637020]  kobject_uevent_env+0x361/0x39a
[ 2050.637091]  kobject_uevent_env+0x361/0x39a
[ 2050.637163]  kfree+0xe4/0x102
[ 2050.637227]  kfree+0xe4/0x102
[ 2050.637294]  load_module+0x15d2/0x199a
[ 2050.637366]  load_module+0x15d2/0x199a
[ 2050.637438]  load_module+0x15d2/0x199a
[ 2050.637509]  SyS_init_module+0x72/0x8a

Saturday, February 15, 2014

ARM64 Linux kernel virtual address space

Now let's talk about the Linux kernel virtual address space on 64-bit ARM CPU. You can find information about ARMv8 in ARM official website. http://www.arm.com/products/processors/armv8-architecture.php

One big problem on 32-bit CPUs is the limited 4GB limitation of virtual address spaces. The problem remains even if some PAE support since it focuses on the extension of physical address space not virtual address space. Things changes after the born of 64-bit CPUs: AMD64 and ARMv8, they can now support up to 2^64 addresses, which is uhh.. a very big number.

Actually 2^64 is too large, so in the Linux kernel implementation, only part of 64 bits are used (42 bits for CONFIG_ARM64_64K_PAGES, 39 bit for 4K page). This article is assuming 4K page is used (VA_BITS = 39 case)

#ifdef CONFIG_ARM64_64K_PAGES
#define VA_BITS                 (42)
#else
#define VA_BITS                 (39)
#endif

One good thing on ARM64 is that since we have enough virtual address bits, user space and kernel space can have their own 2^39 = 512GB virtual addresses!

All user virtual addresses have 25 leading zeros and kernel addresses have 25 leading ones. Address between user space and kernel space are not used and they are used to trap illegal accesses.

ARM64 Linux virtual address space layout

kernel space:

Although we have no ARM64 environment now, we can analysis the kernel virtual address space by reading the source code and observing a running AMD64 Linux box.

In arch/arm64/include/asm/memory.h, we can see the some differences: we have no lowmem zone, since the virtual address is so big that we can treat all memory of lowmem and do not have to worry about virtual address. (Yes, there is still a limit of kernel virtual address). Second, the order of different kernel virtual address changes:

#ifdef CONFIG_ARM64_64K_PAGES
#define VA_BITS                 (42)
#else                               
#define VA_BITS                 (39)
#endif                              
#define PAGE_OFFSET             (UL(0xffffffffffffffff) << (VA_BITS - 1))
#define MODULES_END             (PAGE_OFFSET)
#define MODULES_VADDR           (MODULES_END - SZ_64M)
#define EARLYCON_IOBASE         (MODULES_VADDR - SZ_4M)

         pr_notice("Virtual kernel memory layout:\n"                             
                   "    vmalloc : 0x%16lx - 0x%16lx   (%6ld MB)\n"
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
                   "    vmemmap : 0x%16lx - 0x%16lx   (%6ld MB)\n"
 #endif
                   "    modules : 0x%16lx - 0x%16lx   (%6ld MB)\n"
                   "    memory  : 0x%16lx - 0x%16lx   (%6ld MB)\n"
                   "      .init : 0x%p" " - 0x%p" "   (%6ld kB)\n"
                   "      .text : 0x%p" " - 0x%p" "   (%6ld kB)\n"
                   "      .data : 0x%p" " - 0x%p" "   (%6ld kB)\n",
                   MLM(VMALLOC_START, VMALLOC_END),
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
                   MLM((unsigned long)virt_to_page(PAGE_OFFSET),
                       (unsigned long)virt_to_page(high_memory)),
 #endif
                   MLM(MODULES_VADDR, MODULES_END),
                   MLM(PAGE_OFFSET, (unsigned long)high_memory),

                   MLK_ROUNDUP(__init_begin, __init_end),
                   MLK_ROUNDUP(_text, _etext),
                   MLK_ROUNDUP(_sdata, _edata));

see also:
arch/arm64/mm/init.c
arch/arm64/include/asm/pgtable.h

You can see that there is no pkmap or fixmap, it's because the kernel is assuming every memory has a valid kernel virtual address and there's no need to create pkmap/fixmap.

ARM64 kernel virtual address space layout

User space:

The memory layout implementation of user virtual address space looks like it does on ARM32. Since the available user space virtual address becomes 512GB, we can build a larger application on 64-bit CPUs.

One interesting topic is that ARM claims the ARMv8 is compatible with ARM 32-bit applications, all 32-bit applications can run on ARMv8 without modification.How does the 32-bit application virtual memory layout look like on a 64-bit kernel?

Actually, all process on 64-bit kernel is a 64-bit process. To run ARM 32-bit applications, Linux kernel still create a process from a 64-bit init process, but limit the user address space to 4GB. In this way, we can have both 32-bit and 64-bit application on a 64-bit Linux kernel.

 #ifdef CONFIG_COMPAT
 #define TASK_SIZE_32            UL(0x100000000)
 #define TASK_SIZE               (test_thread_flag(TIF_32BIT) ? \
                                 TASK_SIZE_32 : TASK_SIZE_64)
 #else
 #define TASK_SIZE               TASK_SIZE_64
 #endif /* CONFIG_COMPAT */

64-bit ARM applications on 64-bit Linux kernel

ARM64 64-bit user space program virtual address space layout

32-bit ARM applications on 64-bit Linux kernel

ARM64 32-bit user space program virtual address space layout

Note that the 32-bit application still have a 512GB kernel virtual address space and do not share it's own 4GB of virtual address space with kernel, the user applications have a complete 4GB of virtual address. On the other hand, 32-bit applications on 32-bit kernel have only 3GB of virtual address space.

	ARM32 Linux	ARM64 Linux
32-bit user virtual address space size	3GB	4GB
64-bit user virtual address space size	N/A	512GB
kernel virtual address space	1GB	512GB

ARM32 Linux kernel virtual address space

The 32-bit ARM CPU can address up to 2^32 = 4GB address*. It's not big enough in present days, since the size of available DRAM on computing devices is growing fast and the memory usage of application is growing as well.

In Linux kernel implementation, user space and kernel must coexist in the same 4GB virtual address space. It means both user space and kernel can use less than 4GB virtual address space.
Linux kernel provides 3 different split of virtual address spaces: VMSPLIT_3G, VMSPLIT_2G, VMSPLIT_1G.

Linux virtual address space options

The default configuration is VMSPLIT_3G, as you can see, kernel space starts from 0xC0000000 to 0xFFFFFFFF and user space starts from 0x00000000 to 0xC0000000.

Let's take a closer look of the VMSPLIT_3G mapping:

kernel space

We can observe the kernel virtual address by checking the boot log (dmesg) or take a look at arch/arm/mm/init.c.

lowmem: The memory that have 1-to-1 mapping between virtual and physical address. It means the virtual and physical address are both configuous, and this good property makes the virtual to physical address translation very easy. If we have a virtual address from lowmem, we can find out its physical address by simple shift. (see __pa() and __va()).

vmalloc: The vmalloc memory is only virtually contiguous.

fixmap/pkmap: create fast mapping of a single page for kernel. Most used in file system.

modules: The virtual address for module loading and executing. kernel modules are loaded into this part of virtual memory.

user space

The code for deterring user space virtual address is in arch/arm/mm/mmap.c

The user space have two different kind of mmap layout: legacy and non-legacy. Legacy layout sets the base of mmap(TASK_UNMAPPED_BASE) and the mmap grows in bottom-up manner; on the other case, non-legacy set the mmap base from TASK_SIZE - 128MB with some random shift for security reasons).

void arch_pick_mmap_layout(struct mm_struct *mm)
{
        unsigned long random_factor = 0UL;

        /* 8 bits of randomness in 20 address space bits */
        if ((current->flags & PF_RANDOMIZE) &&
            !(current->personality & ADDR_NO_RANDOMIZE))
                random_factor = (get_random_int() % (1 << 8)) << PAGE_SHIFT;
        if (mmap_is_legacy()) {
                mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
                mm->get_unmapped_area = arch_get_unmapped_area;
        } else {
                mm->mmap_base = mmap_base(random_factor);
                mm->get_unmapped_area = arch_get_unmapped_area_topdown;
        }

The user space virtual address layout looks like:

32-bit user virtual address space layout

*ARM has LPAE (Large Physical Address Extension) mode that can address up to 1TB.

Thursday, December 29, 2011

screenrc

startup_message off
termcapinfo xterm ti@:te@
defscrollback 5000
#caption always "%{= kw}%-w%{= BW}%n %t%{-}%+w %-= @%H - %LD %d %LM - %c"
caption always "%{= kw}%-w%{= BW}%n %t%{-}%+w"
screen -t alpha bash
screen -t bravo bash
screen -t charlie bash
screen -t delta bash
screen -t echo bash
screen -t foxtrot bash
screen -t glof bash
screen -t hotel bash
screen -t india bash
screen -t juliet bash
select alpha

vimrc

set nu
syntax on
colorscheme evening
set cul
set hlsearch
set ruler
set ai
set nojoinspaces
set formatoptions=qtcor
"set foldmethod=marker
set history=500

"ctags
"use the ctags for tag search
set tags=$CTAGFILE
"search non-cscope tag first
set csto=1
ab tl Tlist
ab tls TlistSync

"cscope
"use cscope to find caller, callee...
set nobackup
set cscopetag
cs add $CSCOPEFILE
map c :cs find c =expand("")
map d :cs find d =expand("")
map e :cs find e =expand("")
map f :cs find f =expand("")
map g :cs find g =expand("")
map i :cs find i =expand("")
map s :cs find s =expand("")
map t :cs find t =expand("")

ab sc set spell
ab nsc set nospell
ab #i #include
ab #d #define
ab #e #endif
ab #p #pragma
ab pn PRINT_NOTICE("start\n");
ab pi PRINT_INFO
ab pe PRINT_ERR
ab pd PRINT_DEBUG
ab #c \////////////////////////////////////////////////////////////////////////////////
\
\
\//////////////////////////////////////////////////////////////////////////////

autocmd BufReadPost *.c set sts=8
autocmd BufReadPost *.c set sw=8
autocmd BufReadPost *.c set expandtab
autocmd BufReadPost *.c set cin
autocmd BufReadPost *.cpp set sts=8
autocmd BufReadPost *.cpp set sw=8
autocmd BufReadPost *.cpp set expandtab
autocmd BufReadPost *.cpp set cin
autocmd BufReadPost *.h set sts=8
autocmd BufReadPost *.h set sw=8
autocmd BufReadPost *.h set expandtab
autocmd BufReadPost *.h set cin
autocmd BufReadPost *.tex set tw=80
autocmd BufReadPost *.tex set ai
autocmd BufReadPost *.sh set sts=8
autocmd BufReadPost *.sh set sw=8
autocmd BufReadPost *.sh set expandtab
autocmd BufReadPost *.sh set cin

bashrc

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# check terminal
if [ "$TERM" = "rxvt-unicode" ]; then
alias vim='TERM=xterm vim'
fi

# set a fancy prompt (non-color, overwrite the one in /etc/profile)
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\$ '

# User specific aliases and functions

alias ls="ls --color -F"
#alias ls="ls"
export P4CONFIG=.p4config
export SVN_EDITOR=vim
export PATH=/sbin:$PATH

#alias expctgp="export CTAGFILE=$HOME/Perforce/tags; cd ~/Perforce/"
#alias expctgw="export CTAGFILE=$HOME/workshop/tags; cd ~/workshop/"
alias expctgl="export CTAGFILE=$HOME/workshop/source/linux-3.1.1/tags; export CSCOPEFILE=$HOME/workshop/source/linux-3.1.1/cscope.out; cd $HOME/workshop/source/linux-3.1.1"

set_ctag() {
echo "export CTAGFILE=`pwd`/$@"
export CTAGFILE=`pwd`/$@
}

set_cscope() {
echo "export CTAGFILE=`pwd`/$@"
export CSCOPEFILE=`pwd`/$@
}

# convert a number into decimal format
n2d() {
perl -e "printf \"%d\n\", $@"
}

# convert a number into decimal format (floating number)
n2f() {
perl -e "printf \"%f\n\", $@"
}

# convert a number into hex format
n2h() {
perl -e "printf \"%08X\n\", $@"
}
alias n2x="n2h"

# convert a number into binary format
n2b() {
perl -e "printf \"%08b\n\", $@"
}

# convert a number into number of mage (2^20)
n2m() {
perl -e "printf \"%d (M)\n\", $@/1024/1024"
}

# convert binary number into hex format
b2h() {
perl -e "printf \"%02x\n\", 0b$@"
}
alias b2x="b2h"

Thursday, April 21, 2011

converting hex/decimal/bianry with bash

put this functions in the ~/.bashrc

# convert a number into decimal format
n2d() {
perl -e "printf \"%d\n\", $@"
}

# convert a number into hex format
n2h() {
perl -e "printf \"0x%X\n\", $@"
}

# convert a number into binary format
n2b() {
perl -e "printf \"0b%b\n\", $@"
}

Friday, October 29, 2010

set default window size and font in gVim

click on
edit>startup settings to edit the vimrc in windows

"set defautl window size
set lines=24
set columns=90
"set default font:size
set guifont=Courier\ New:h14

Monday, September 27, 2010

gen_ctags

# setting
WORKDIR=`pwd`
# get all file list
echo "get all file list"
rm ctags.Uranus ctags.utopia 2>/dev/null
find $WORKDIR/Uranus/ -name '*.[HhCc]' >> ctags.Uranus
find $WORKDIR/Uranus/ -name '*.cc' >> ctags.Uranus
find $WORKDIR/Uranus/ -name '*.cpp' >> ctags.Uranus
find $WORKDIR/utopia/ -name '*.[HhCc]' >> ctags.utopia
find $WORKDIR/utopia/ -name '*.cc' >> ctags.utopia
find $WORKDIR/utopia/ -name '*.cpp' >> ctags.utopia

# remove unwanted files

echo "remove unwanted files"
# Uranus
sed -i /Cus60/d ctags.Uranus
sed -i /eCospro/d ctags.Uranus
sed -i /Trunk/d ctags.Uranus
sed -i /u3/d ctags.Uranus
# utopis
sed -i /u2/d ctags.utopia
sed -i /u3/d ctags.utopia

sed -i /t2/d ctags.utopia
sed -i /t4/d ctags.utopia
sed -i /t7/d ctags.utopia
sed -i /t8/d ctags.utopia
sed -i /t9/d ctags.utopia
sed -i /t11/d ctags.utopia
sed -i /t12/d ctags.utopia
sed -i /t13/d ctags.utopia
sed -i /titania4/d ctags.utopia
sed -i /titania7/d ctags.utopia
sed -i /titania8/d ctags.utopia
sed -i /titania9/d ctags.utopia

sed -i /Janus/d ctags.utopia
sed -i /janus/d ctags.utopia
sed -i /j2/d ctags.utopia

sed -i /s7/d ctags.utopia
sed -i /s7j/d ctags.utopia
sed -i /s7ml/d ctags.utopia
sed -i /s7ld/d ctags.utopia
sed -i /s8/d ctags.utopia

sed -i /maria10/d ctags.utopia
sed -i /prans2/d ctags.utopia

sed -i /r2/d ctags.utopia

# generate tag file
cat ctags.Uranus ctags.utopia > ctags.files
echo "generate tag file"
ctags -L ctags.files
#cscope -b -q -k

# generate file list for Perforce
echo "generate file list for Perforce"
ESCSTR=$(echo "$WORKDIR" | sed -e 's/\//\\\//g')
echo $ESCSTR
echo "sed -e 's/$ESCSTR//g' ctags.files > ctags.p4"
sed -e "s/$ESCSTR//g" ctags.files > ctags.p4

# export global variable
echo "===== IMPORTANT ====="
echo "please export the following variable"
echo "export CTAGFILE=$WORKDIR/tags"

Saturday, September 25, 2010

linux dump stace functions

1) dump_stack

Thursday, September 2, 2010

vimrc

ab tl Tlist
ab tls TlistSync

where

Tlist is for ctag list
TlistSync is for syncing the ctag navigating window with the current cursor

Tuesday, June 29, 2010

vimrc

syntax on
set nu
set ruler
set hlsearch
set cul
colorscheme evening
set vb
cs add cscope.out
set cscopetag
set foldmethod=marker
set tw=80
set ai
set nojoinspaces
set formatoptions=qtcor

"for C programming
ab #i #include
ab #d #define
ab #e #endif
ab #p #pragma
ab #m Min-Hua Chen

"spell check
ab sc set spell
ab nsc set nospell
"ab #c Author(s): xxx
"\Copyright (c) 2008 xxx
"\Permission to copy, modify, and distribute this program is granted

ab #c Author(s): xxx

autocmd BufReadPost *.c set cin
autocmd BufReadPost *.h set cin
autocmd BufReadPost *.pl set cin
autocmd BufReadPost *.cpp set cin
autocmd BufReadPost *.php set cin
autocmd BufReadPost *.py set si
autocmd BufReadPost *.java set si
autocmd BufReadPost *.java set cino+=j1
autocmd BufReadPost *.java set sw=4

Wednesday, March 31, 2010

X forwarding in ssh

To enable X forwarding, try the following command
ssh -X xxx.xxx.xxx.xxx

than we can launch a program with GUI through ssh.

Sometimes it does not work as expected, check the DISPLAY environment variable.

export DISPLAY=:10.0

Wednesday, March 17, 2010

Remove VirtualBox lock

The lock file of VirtualBox is placed in
/tmp/.vbox-[user name]-ipc

Sunday, December 13, 2009

double fork to avoid zombie process

It is a common mistake to fork a child process without calling waitpid() to wait for the termination of the child process. Without a wait() call, the child process will become a zombie process after its termination because its parent process does not cleanup its process information in the system. A zombie process occupies a pid in the system, decrease the available pids in the system. Zombies are mark as "defunct" if you check the process by the "ps" command.

However, sometimes we do not want the parent process to wait for its child process for a long time. There is a way to achieve both "not create zombie process" and "not wait for the child process to its termination", and the way is to do a double fork.

The idea is simple, when a parent process (say A) want to fork a child process to do "something". Process A does not fork a process to do "something" directly. Process A first forks a child process (say B), and process then forks its child process (say C) to do "something" and process B terminates as soon as process C is created. In this way, process A only has to wait for process B for a short time. In the same time, since it has no parent process (process B is dead), the system will "rechild" process C to the init process. The init process calls wait() for its child process, solving the zombie process problem.

The program looks like


void func()
{
     pit_t pid1;
     pit_t pid2;
     int status;

     if (pid1 = fork()) {
             /* parent process A */
             waitpid(pid1, &status, NULL);
     } else if (!pit1) {
             /* child process B */
             if (pid2 = fork()) {
                     exit(0);
             } else if (!pid2) {
                     /* child process C */
                     execvp("something");
             } else {
                     /* error */
             }
     } else {
             /* error */
     }
}

Monday, December 7, 2009

kill all child process

killtree () {
for child in $(ps -o pid= --ppid $1)
do
killtree $child
done
echo "kill -9 $1"
kill -9 $1 2>/dev/null
}
killtree {some pid}

read line in bash script

It is very easy to read "words" in a bash script. But what if you want to read a line in a text file? I have this problem today, After some googling, I found a interesting way to do this.

while read l
do
echo $l
done < xxx.txt

It use the xxx.txt as input, and the read command in bash will read the input line by line.

linux 802.1q support

In the previous article "vlan tag under vmware/virtualbox", I found that the e1000 network driver cannot insert vlan tag to frames. In this article I want to tell you that I WAS WRONG.

After some experiments, I found that the e1000 driver does insert the vlan tag into the frames. The problem is that the e1000 driver automatically remove vlan tags when receiving incoming frames before I was trying to read the incoming frames by tcpdump.

So, we can now use liunx + 802.1q module to communicate with vlan members.