Wednesday, November 11, 2015

Uninitialized data problem in the LZMA compressor used by TokuFT

The LZMA algorithm implemented by the xz compression software package is one of the compression algorithms used by TokuDB for MySQL and TokuMX for MongoDB.  Unfortunately, valgrind's memcheck reports an uninitialized variable problem in the xz software encoder when running applications, including this simple test case.  The problem is benign with respect to correct functionality.  However, the problem is interesting with respect to how the gcc compiler generates code for logical AND operations in C language programs.   The summary of the problem is that valgrind reports a conditional jump using uninitialized data because the xz software encoder did not initialize all of its variables and gcc reordered the evaluation of the operands in a logical AND operator.

The problem in lz_encoder.c line 226 can be reduced to the following code.  The LZMA encoder uses a match finder data structure when compressing data.  When a match finder structure is allocated, the match finder constructor initializes 'mf->buffer' variable but does not initialize 'mf->size' variable.  So, these two variables have the following values after construction.

    mf->buffer = NULL
    mf->size is not initialized so its value is undefined

Later, when the LZMA encoder is used, it sometimes needs to allocate a new 'mf->buffer'.  It first reads the size of the buffer.

    old_size = mf->size

After this, 'old_size' is undefined since 'mf->size' is undefined.  A new size for the buffer is computed using some other values which happen to be initialized.

    mf->size = compute new size not using the old_size variable

Valgrind's memcheck reports the conditional jump using uninitialized data error when executing the following if statement:

    if (mf->buffer != NULL && old_size != mf->size) {
        do something
    }

The gcc compiler generates code that evaluates the 'old_size != mf->size' expression before the 'mf->buffer != NULL' expression.  This was verified by examining the instructions generated by the compiler.  Since the 'old_size' variable has an undefined value, the result of the comparison is undefined, and memcheck reports it.  Luckily, since 'mf->buffer != NULL' is false, the if statement condition is false regardless of the undefined 'old_size' value.

The C language definition says operands for a logical AND operator are evaluated left to right.  See section 6.5.13.  In addition, if the leftmost operand evaluates to 0, then the operands to its right are not evaluated.  This is known as short circuit evaluation.  The C languages definition says that a sequence point exists between the left and right operands in a logical AND operator.  See section 5.1.2.3.  The sequence point implies that all values and side effects of the left operand are computed before all values and side effects of the right operand.  Is the result of the evaluation of an operand a side effect?

The gcc compiler seems to be violating the left to right execution order.  Is it OK?  Since there are no side effects in the 'mf->buffer != NULL' and 'old_size != mf->size' operands, then the evaluation order does not matter.  Since the execution order is only switched when using aggressive compiler optimization, the compiler probably decided that switching the evaluation order could lead to faster execution.

Computing with undefined values is generally dangerous since the program may compute the wrong result.  Luckily, this is not the case in the xz encoder since the undefined value results in the same top level behavior.  However, if one wants to use a memory checker like valgrind's  memcheck or the memory sanitizer to verify a program, one should fix bugs related to undefined values.  The fix for the undefined data in the xz encoder has been committed to the xz github repository.  As you can see, it merely initializes an additional variable in the match finder constructor.

Versions

Ubuntu 14.04
gcc 4.9.3
xz util v5.2.2


3 comments:

  1. In this case, it's a legal compiler optimization. Sequence points provide the semantics for a serial execution of a C program, but they don't constrain the optimizer; as long as the resulting code has the same semantics, it's OK. For example, C compilers routinely reorder statements, and yet there is a sequence point between every statement.

    In some situations this optimization would be incorrect. I cannot always reorder this expression
    bool f(bool *a, int i, int j) {
    return a[i] && a[j];
    }
    because it's possible that evaluating a[j] *does* have a side effect. (What if &a[j] is an invalid address?)

    In the case of the lz_encoder, it's safe to reorder the expressions since the original code both reads and writes mf->size, so if reading mf->size causes a segfault, the unordered code still does. I think the moral is that it's very difficult to write these optimizations correctly, which is one reason it's better left to a compiler.

    I suspect that someday the compiler will optimize your patch away, and that there will be no way to run valgrind on lzma compiled at a sufficiently high optimization level.

    ReplyDelete
  2. The patch is to initialize a variable before using it, not reorder the operands in a logical AND operation. I don't see how an optimizer will generate code that violates that order.

    ReplyDelete
  3. The original LZMA encoder is correct even though some of its variables sometimes have undefined values. Valgrind is reporting a false positive because the gcc optimizer decided to rearrange the evaluation of the operands in an if statement. In this particular case, the evaluation order does not matter to the C program. Since valgrind does not understand the higher level C program because it is an instruction emulator, it must report a false positive.

    We want to continue using tools like valgrind since it can find bugs like memory leaks and uses of uninitialized variables that are common in C programs. What should we do?

    We could add a valgrind suppression for this false positive. IMO, a suppression can be lost inside of a large function, which means that we have given up on that function.

    We could only use valgrind on code that has not been heavily optimized by the compiler. There are probably no guarantees that this will always work as Bradley nodes.

    We could change the code to initialize its variables before they are read and used. This is what the patch does.

    Or, we could use a different technology like the address and memory sanitizers that insert verification code into a program at compile time. I suspect that this is more likely to work.

    ReplyDelete