Friday, April 8, 2022

[SOLVED] C++ with OpenMP try to avoid the false sharing for tight looped array

Issue

I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:

#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>

using std::cout;
using std::endl;

#define NUM 100000

int main()
{
    double data[NUM] __attribute__ ((aligned (128)));;

    #ifdef _OPENMP
        auto t1 = omp_get_wtime();
    #else
        auto t1 = std::chrono::steady_clock::now();
    #endif

    for(long int k=0; k<100000; ++k)
    {

        #pragma omp parallel for schedule(static, 16) num_threads(4)
        for(long int i=0; i<NUM; ++i)
        {
            data[i] = cos(sin(i*i+ k*k));
        }
    }

    #ifdef _OPENMP
        auto t2 = omp_get_wtime();
        auto duration = t2 - t1;
        cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
    #else
        auto t2 = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
    #endif

    double tempsum = 0.;
    for(long int i=0; i<NUM; ++i)
    {
        int nextind = (i == 0 ? 0 : i-1);
        tempsum += i + sin(data[i]) + cos(data[nextind]);
    }
    cout<<"Raw data sum: "<<tempsum<<endl;
    return 0;    
}

Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.

Build as

g++ -o test test.cpp 

or

g++ -o test test.cpp -fopenmp

The program reported results as:

No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09

OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09

Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.

I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.

I update the new test results after increasing the loop size, the OpenMP runs faster as expected.

Thank you!


Solution

  1. Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
  2. Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
  3. You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
  4. Your loop is pretty short.


Answered By - Victor Eijkhout
Answer Checked By - David Marino (WPSolving Volunteer)