Friday, August 2, 2013

How to fast initialize a really big array with 16, 32, 64 bit asf values

Sometimes we need a version of memset which sets a value that is larger than one byte. memset only uses one byte of the value passed in and does byte-wise initialization. If you want to initialize an array with a particular value larger than one byte, just use these approaches:

1. Just use a simple for ( ; ; )  In most cases this is a simple and good way.

2. std::fill, std::fill_n

#define BUFF_SIZE 1024 * 1024 * 1536
...

uint16_t val16 = 10000;
uint32_t val32 = 10000 * 10000L;
uint64_t val64 = 10000 * 10000 * 10000LL;
void* buff = new uint8_t[BUFF_SIZE];
...
std::fill_n((uint16_t*)buff, BUFF_SIZE / sizeof(val16), val16);
std::fill_n((uint32_t*)buff, BUFF_SIZE / sizeof(val32), val32);
std::fill_n((uint64_t*)buff, BUFF_SIZE / sizeof(val64), val64);


3. The memset function is very optimized in the compilers. Some implementations offer a 16-bit version memsetw, but that's not everywhere. The memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time. SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions. So, here is a memsetx function which is implemented via memcpy:

template <typename T>
void memsetx(void* dst, T& val, unsigned int size)
{
    uint32_t i = 0;

    for ( ; i < (size & (~(sizeof(T) - 1))); i += sizeof(T))
        memcpy((uint8_t*)dst + i, &val, sizeof(T));

    for ( ; i < size; ++i)
        ((uint8_t*)dst)[i] = ((uint8_t*)&val)[i & (sizeof(T) - 1)];
}
...
memsetx(buff, val16, BUFF_SIZE);
memsetx(buff, val32, BUFF_SIZE);
memsetx(buff, val64, BUFF_SIZE);
...

memsetx(buffval1024, BUFF_SIZE);
...

4 comments:

  1. Here is an idea for memsetx. Initialize 1 element, copy the value to 2nd, copy 2 values to 3rd and 4th, copy 4 values to 5-8 and so on. So instead of O(n) for initialization it is O(ln n) now.

    ReplyDelete
    Replies
    1. Thanks, nice idea, here is the implementation:

      template
      void memsetx(void* dst, const T& val, unsigned int size)
      {
      auto i = sizeof val;
      auto p = (int8_t*)dst;

      memcpy(p, &val, i);

      for ( ; i <= (size >> 1); i <<= 1)
      memcpy(p + i, p, i);

      auto n = (size - i) & (~(sizeof val - 1));
      memcpy(p + i, p + i - n, n);

      for ( i += n; i < size; ++i)
      p[i] = ((int8_t*)&val)[i & (sizeof val - 1)];
      }

      I am also thinking to write benchmarks for all ways of memory initialization to check what is better and how to optimize more.

      Delete
    2. For the rest of array you can just memcpy for all uninitialized elements with 1 operation. Also why do you need that last loop? One for with 1 memcpy and one additional memcpy should be enough.

      Delete
    3. Yes, you are right, it is redundant. The last for loop was copied from the first version, and it was needed for example when sizeof(val)==8 and size==10.

      Delete