parallel processing - parallelizing in openMP -

January 15, 2012

i have following code want paralleize using openmp

for(m=0; m<r_c; m++) {     for(n=0; n<c_c; n++)     {         double value = 0.0;         for(j=0; j<r_b; j++)             for(k=0; k<c_b; k++)             {                 double a;                 if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)                     = 0.0;                 else                     = h_a[((m-j)*c_a) + (n-k)];                 //printf("%lf\t", a);                 value += h_b[(j*c_b) + k] * a;             }         h_c[m*c_c + n] = value;         //printf("%lf\t", h_c[m*c_c + n]);     }     //cout<<"row "<<m<<" completed"<<endl; }

in want every thread perform "for j" , "for k" simultaneouly. trying using pragma omp parallel before "for m" loop not getting correct result. how can in optimized manner. in advance.

depending loop want parallelize, have 3 options:

#pragma omp parallel { #pragma omp    // option #1     for(m=0; m<r_c; m++)     {         for(n=0; n<c_c; n++)         {             double value = 0.0; #pragma omp    // option #2             for(j=0; j<r_b; j++)                 for(k=0; k<c_b; k++)                 {                     double a;                     if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)                         = 0.0;                     else                         = h_a[((m-j)*c_a) + (n-k)];                     //printf("%lf\t", a);                     value += h_b[(j*c_b) + k] * a;                 }                 h_c[m*c_c + n] = value;                 //printf("%lf\t", h_c[m*c_c + n]);         }         //cout<<"row "<<m<<" completed"<<endl;     } } ////////////////////////////////////////////////////////////////////////// // option #3 for(m=0; m<r_c; m++) {     for(n=0; n<c_c; n++)     { #pragma omp parallel             {             double value = 0.0; #pragma omp             for(j=0; j<r_b; j++)                 for(k=0; k<c_b; k++)                 {                     double a;                     if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)                         = 0.0;                     else                         = h_a[((m-j)*c_a) + (n-k)];                     //printf("%lf\t", a);                     value += h_b[(j*c_b) + k] * a;                 }             h_c[m*c_c + n] = value;             //printf("%lf\t", h_c[m*c_c + n]);         }     }     //cout<<"row "<<m<<" completed"<<endl; }

test , profile each. might find option #1 fastest if there isn't lot of work each thread, or may find optimizations on, there no difference (or slowdown) when enabling omp.

edit

i've adopted mcve supplied in comments follows:

#include <iostream> #include <chrono> #include <omp.h> #include <algorithm> #include <vector>  #define w_omp int main(int argc, char *argv[]) {     std::vector<double> h_a(9);     std::generate(h_a.begin(), h_a.end(), std::rand);     int r_b = 500;     int c_b = r_b;     std::vector<double> h_b(r_b * c_b);     std::generate(h_b.begin(), h_b.end(), std::rand);     int r_c = 500;     int c_c = r_c;     int r_a = 3, c_a = 3;     std::vector<double> h_c(r_c * c_c);      auto start = std::chrono::system_clock::now();  #ifdef w_omp #pragma omp parallel      { #endif         int m,n,j,k; #ifdef w_omp #pragma omp  #endif         for(m=0; m<r_c; m++)         {             for(n=0; n<c_c; n++)             {                 double value = 0.0,a;                 for(j=0; j<r_b; j++)                 {                     for(k=0; k<c_b; k++)                     {                         if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)                             = 0.0;                         else = h_a[((m-j)*c_a) + (n-k)];                         value += h_b[(j*c_b) + k] * a;                     }                 }                 h_c[m*c_c + n] = value;             }         } #ifdef w_omp     } #endif     auto end = std::chrono::system_clock::now();     auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);     std::cout << elapsed.count() << "ms" #ifdef w_omp         "\t omp" #else         "\t without omp" #endif         "\n";      return 0;  }

as reference, i'm using vs2012 (omp 2.0, grrr). i'm not sure when collapse introduced, apparently after 2.0. optimizations /o2 , compiled in release x64.

benchmarks

using original sizes of loops (7,7,5,5) , therefore arrays, results 0ms without omp , 1ms with. verdict: optimizations better, , added overhead wasn't worth it. also, measurements not reliable (too short).

using larger sizes of loops (100, 100, 100, 100) , therefore arrays, results equal @ 108ms. verdict: still not worth naive effort, tweaking omp parameters might tip scale. not x4 speedup hope for.

using larger sizes of loops (500, 500, 500, 500) , therefore arrays, omp started pull ahead. without omp 74.3ms, 15s. verdict: worth it. weird. got x5 speedup 4 threads , 4 cores on i5. i'm not going try , figure out how happened.

summary

as has been stated in countless answers here on so, it's not idea parallelize every for loop come across. things can screw desired xn speedup:

not enough work per thread justify overhead of creating additional threads
the work memory bound. means cpu can running @ 1petahz , still won't see speedup.
memory access patterns. i'm not going go there. feel free edit in relevant info if want it.
omp parameters. best choice of parameters result of entire list (not including item, avoid recursion issues).
simd operations. depending on , how you're doing, compiler may vectorize operations. have no idea if omp usurp simd operations, possible. check assembly (foreign language me) confirm.

Search This Blog

Script