Wednesday, November 11, 2015

Uninitialized data problem in the LZMA compressor used by TokuFT

The LZMA algorithm implemented by the xz compression software package is one of the compression algorithms used by TokuDB for MySQL and TokuMX for MongoDB.  Unfortunately, valgrind's memcheck reports an uninitialized variable problem in the xz software encoder when running applications, including this simple test case.  The problem is benign with respect to correct functionality.  However, the problem is interesting with respect to how the gcc compiler generates code for logical AND operations in C language programs.   The summary of the problem is that valgrind reports a conditional jump using uninitialized data because the xz software encoder did not initialize all of its variables and gcc reordered the evaluation of the operands in a logical AND operator.

The problem in lz_encoder.c line 226 can be reduced to the following code.  The LZMA encoder uses a match finder data structure when compressing data.  When a match finder structure is allocated, the match finder constructor initializes 'mf->buffer' variable but does not initialize 'mf->size' variable.  So, these two variables have the following values after construction.

    mf->buffer = NULL
    mf->size is not initialized so its value is undefined

Later, when the LZMA encoder is used, it sometimes needs to allocate a new 'mf->buffer'.  It first reads the size of the buffer.

    old_size = mf->size

After this, 'old_size' is undefined since 'mf->size' is undefined.  A new size for the buffer is computed using some other values which happen to be initialized.

    mf->size = compute new size not using the old_size variable

Valgrind's memcheck reports the conditional jump using uninitialized data error when executing the following if statement:

    if (mf->buffer != NULL && old_size != mf->size) {
        do something
    }

The gcc compiler generates code that evaluates the 'old_size != mf->size' expression before the 'mf->buffer != NULL' expression.  This was verified by examining the instructions generated by the compiler.  Since the 'old_size' variable has an undefined value, the result of the comparison is undefined, and memcheck reports it.  Luckily, since 'mf->buffer != NULL' is false, the if statement condition is false regardless of the undefined 'old_size' value.

The C language definition says operands for a logical AND operator are evaluated left to right.  See section 6.5.13.  In addition, if the leftmost operand evaluates to 0, then the operands to its right are not evaluated.  This is known as short circuit evaluation.  The C languages definition says that a sequence point exists between the left and right operands in a logical AND operator.  See section 5.1.2.3.  The sequence point implies that all values and side effects of the left operand are computed before all values and side effects of the right operand.  Is the result of the evaluation of an operand a side effect?

The gcc compiler seems to be violating the left to right execution order.  Is it OK?  Since there are no side effects in the 'mf->buffer != NULL' and 'old_size != mf->size' operands, then the evaluation order does not matter.  Since the execution order is only switched when using aggressive compiler optimization, the compiler probably decided that switching the evaluation order could lead to faster execution.

Computing with undefined values is generally dangerous since the program may compute the wrong result.  Luckily, this is not the case in the xz encoder since the undefined value results in the same top level behavior.  However, if one wants to use a memory checker like valgrind's  memcheck or the memory sanitizer to verify a program, one should fix bugs related to undefined values.  The fix for the undefined data in the xz encoder has been committed to the xz github repository.  As you can see, it merely initializes an additional variable in the match finder constructor.

Versions

Ubuntu 14.04
gcc 4.9.3
xz util v5.2.2