Issue
I have a code that looks like this:
void SimulationStep ( float *In, float *Out, float L, int N)
{
Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
for (int x=1; x<N-1; x++)
{
Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1];
}
Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}
I am trying to parallelize it. I've tried so many things and this is one example:
void SimulationStep ( float *In, float *Out, float L, int N)
{
Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
#pragma omp parallel for
for (int x=1; x<N-1; x++)
{
Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1];
}
Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}
The changes I've applied only gain a boost of .5 seconds, from 14s to 13,5s so I suspect that the code is not indeed parallelized. I think it might be a memory-bound bottleneck so I don't know what to do. Thank you in advance. Ps: I'm compiling with gcc/9.2.0 with -03 -fopenmp.
Solution
@Black_Alistar is implying Amdahl's Law, the amount of speed-up is limited by the proportion of the executable that can be parallelized. It looks your application is spending most of its time outside the simulation loop.
When I saw the marginal improvement in execution time, I immediately thought your application is I/O bounded:
- If you store intermediate steps in your simulation then your application is likely to be I/O bounded and not CPU bounded. Therefore, using any parallel algorithm will make almost no difference;
- Slow reading or writing could be because you are handling the data one item at a time;
- Do not read or write in small blocks when using unbuffered I/O.
Answered By - Daniel Dearlove