parallel processing - parallelizing in openMP -
i have following code want paralleize using openmp
for(m=0; m<r_c; m++) { for(n=0; n<c_c; n++) { double value = 0.0; for(j=0; j<r_b; j++) for(k=0; k<c_b; k++) { double a; if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a) = 0.0; else = h_a[((m-j)*c_a) + (n-k)]; //printf("%lf\t", a); value += h_b[(j*c_b) + k] * a; } h_c[m*c_c + n] = value; //printf("%lf\t", h_c[m*c_c + n]); } //cout<<"row "<<m<<" completed"<<endl; }
in want every thread perform "for j" , "for k" simultaneouly. trying using pragma omp parallel before "for m" loop not getting correct result. how can in optimized manner. in advance.
depending loop want parallelize, have 3 options:
#pragma omp parallel { #pragma omp // option #1 for(m=0; m<r_c; m++) { for(n=0; n<c_c; n++) { double value = 0.0; #pragma omp // option #2 for(j=0; j<r_b; j++) for(k=0; k<c_b; k++) { double a; if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a) = 0.0; else = h_a[((m-j)*c_a) + (n-k)]; //printf("%lf\t", a); value += h_b[(j*c_b) + k] * a; } h_c[m*c_c + n] = value; //printf("%lf\t", h_c[m*c_c + n]); } //cout<<"row "<<m<<" completed"<<endl; } } ////////////////////////////////////////////////////////////////////////// // option #3 for(m=0; m<r_c; m++) { for(n=0; n<c_c; n++) { #pragma omp parallel { double value = 0.0; #pragma omp for(j=0; j<r_b; j++) for(k=0; k<c_b; k++) { double a; if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a) = 0.0; else = h_a[((m-j)*c_a) + (n-k)]; //printf("%lf\t", a); value += h_b[(j*c_b) + k] * a; } h_c[m*c_c + n] = value; //printf("%lf\t", h_c[m*c_c + n]); } } //cout<<"row "<<m<<" completed"<<endl; }
test , profile each. might find option #1 fastest if there isn't lot of work each thread, or may find optimizations on, there no difference (or slowdown) when enabling omp.
edit
i've adopted mcve supplied in comments follows:
#include <iostream> #include <chrono> #include <omp.h> #include <algorithm> #include <vector> #define w_omp int main(int argc, char *argv[]) { std::vector<double> h_a(9); std::generate(h_a.begin(), h_a.end(), std::rand); int r_b = 500; int c_b = r_b; std::vector<double> h_b(r_b * c_b); std::generate(h_b.begin(), h_b.end(), std::rand); int r_c = 500; int c_c = r_c; int r_a = 3, c_a = 3; std::vector<double> h_c(r_c * c_c); auto start = std::chrono::system_clock::now(); #ifdef w_omp #pragma omp parallel { #endif int m,n,j,k; #ifdef w_omp #pragma omp #endif for(m=0; m<r_c; m++) { for(n=0; n<c_c; n++) { double value = 0.0,a; for(j=0; j<r_b; j++) { for(k=0; k<c_b; k++) { if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a) = 0.0; else = h_a[((m-j)*c_a) + (n-k)]; value += h_b[(j*c_b) + k] * a; } } h_c[m*c_c + n] = value; } } #ifdef w_omp } #endif auto end = std::chrono::system_clock::now(); auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start); std::cout << elapsed.count() << "ms" #ifdef w_omp "\t omp" #else "\t without omp" #endif "\n"; return 0; }
as reference, i'm using vs2012 (omp 2.0, grrr). i'm not sure when collapse
introduced, apparently after 2.0. optimizations /o2 , compiled in release x64.
benchmarks
using original sizes of loops (7,7,5,5) , therefore arrays, results 0ms
without omp , 1ms
with. verdict: optimizations better, , added overhead wasn't worth it. also, measurements not reliable (too short).
using larger sizes of loops (100, 100, 100, 100) , therefore arrays, results equal @ 108ms
. verdict: still not worth naive effort, tweaking omp parameters might tip scale. not x4 speedup hope for.
using larger sizes of loops (500, 500, 500, 500) , therefore arrays, omp started pull ahead. without omp 74.3ms
, 15s
. verdict: worth it. weird. got x5 speedup 4 threads , 4 cores on i5. i'm not going try , figure out how happened.
summary
as has been stated in countless answers here on so, it's not idea parallelize every for
loop come across. things can screw desired xn speedup:
- not enough work per thread justify overhead of creating additional threads
- the work memory bound. means cpu can running @ 1petahz , still won't see speedup.
- memory access patterns. i'm not going go there. feel free edit in relevant info if want it.
- omp parameters. best choice of parameters result of entire list (not including item, avoid recursion issues).
- simd operations. depending on , how you're doing, compiler may vectorize operations. have no idea if omp usurp simd operations, possible. check assembly (foreign language me) confirm.
Comments
Post a Comment