c++ bool - Is it necessary to use a std::atomic to signal that a thread has finished execution?



example struct (5)

I would like to check if a std::thread has finished execution. Searching I found the following question which addresses this issue. The accepted answer proposes having the worker thread set a variable right before exiting and having the main thread check this variable. Here is a minimal working example of such a solution:

#include <unistd.h>
#include <thread>

void work( bool* signal_finished ) {
  sleep( 5 );
  *signal_finished = true;
}

int main()
{
  bool thread_finished = false;
  std::thread worker(work, &thread_finished);

  while ( !thread_finished ) {
    // do some own work until the thread has finished ...
  }

  worker.join();
}

Someone who commented on the accepted answer claims that one cannot use a simple bool variable as a signal, the code was broken without a memory barrier and using std::atomic<bool> would be correct. My initial guess is that this is wrong and a simple bool is sufficient, but I want to make sure I'm not missing something. Does the above code need a std::atomic<bool> in order to be correct?

Let's assume the main thread and the worker are running on different CPUs in different sockets. What I think would happen is, that the main thread reads thread_finished from its CPU's cache. When the worker updates it, the cache coherency protocol takes care of writing the workers change to global memory and invalidating the main thread's CPU's cache so it has to read the updated value from global memory. Isn't the whole point of cache coherence to make code like the above just work?


Answers

Cache coherency algorithms are not present everywhere, nor are they perfect. The issue surrounding thread_finished is that one thread tries to write a value to it while another thread tries to read it. This is a data race, and if the accesses are not sequenced, it results in undefined behavior.


Using a raw bool is not sufficient.

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. § 1.10 p21

Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location. § 1.10 p4

Your program contains a data race where the worker thread writes to the bool and the main thread reads from it, but there is no formal happens-before relation between the operations.

There are a number of different ways to avoid the data race, including using std::atomic<bool> with appropriate memory orderings, using a memory barrier, or replacing the bool with a condition variable.


It's not ok. Optimizer can optimize

  while ( !thread_finished ) {
    // do some own work until the thread has finished ...
  }

to:

  if(!thread_finished)
    while (1) {
      // do some own work until the thread has finished ...
    }

assuming it can prove, that "some own work" doesn't change thread_finished.


Someone who commented on the accepted answer claims that one cannot use a simple bool variable as a signal, the code was broken without a memory barrier and using std::atomic would be correct.

The commenter is right: a simple bool is insufficient, because non-atomic writes from the thread that sets thread_finished to true can be re-ordered.

Consider a thread that sets a static variable x to some very important number, and then signals its exit, like this:

x = 42;
thread_finished = true;

When your main thread sees thread_finished set to true, it assumes that the worker thread has finished. However, when your main thread examines x, it may find it set to a wrong number, because the two writes above have been re-ordered.

Of course this is only a simplified example to illustrate the general problem. Using std::atomic for your thread_finished variable adds a memory barrier, making sure that all writes before it are done. This fixes the potential problem of out-of-order writes.

Another issue is that reads to non-volatile variables can be optimized out, so the main thread would never notice the change in the thread_finished flag.


Important note: making your thread_finished volatile is not going to fix the problem; in fact, volatile should not be used in conjunction with threading - it is intended for working with memory-mapped hardware.


Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:

          log (x)
             y
log (x) = _______
   n      log (n)
             y

where, for this algorithm, y equals 2.

A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.

When using these functions, x is expected to be a fixed-point value scaled according to the specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.

#include <errno.h>
#include <stddef.h>

#include "log2fix.h"

#define INV_LOG2_E_Q1DOT31  UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10

int32_t log2fix (uint32_t x, size_t precision)
{
    int32_t b = 1U << (precision - 1);
    int32_t y = 0;

    if (precision < 1 || precision > 31) {
        errno = EINVAL;
        return INT32_MAX; // indicates an error
    }

    if (x == 0) {
        return INT32_MIN; // represents negative infinity
    }

    while (x < 1U << precision) {
        x <<= 1;
        y -= 1U << precision;
    }

    while (x >= 2U << precision) {
        x >>= 1;
        y += 1U << precision;
    }

    uint64_t z = x;

    for (size_t i = 0; i < precision; i++) {
        z = z * z >> precision;
        if (z >= 2U << precision) {
            z >>= 1;
            y += b;
        }
        b >>= 1;
    }

    return y;
}

int32_t logfix (uint32_t x, size_t precision)
{
    uint64_t t;

    t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;

    return t >> 31;
}

int32_t log10fix (uint32_t x, size_t precision)
{
    uint64_t t;

    t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;

    return t >> 31;
}

The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.

[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.





c++ c++11 stdthread stdatomic