C ++ overhead stream

I play with streams in C ++, in particular, using them to parallelize a map operation.

Here is the code:

#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>

double multByTwo(double x){
  return x*2;
}

double doJunk(double x){
  return cos(pow(sin(x*2),3));
}

template <typename T>
void map(T* data, int n, T (*ptr)(T)){
  for (int i=0; i<n; i++)
    data[i] = (*ptr)(data[i]);
}

template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
  int NUMCORES = 3;
  std::vector<std::thread> threads;
  for (int i=0; i<NUMCORES; i++)
    threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
  for (std::thread& t : threads)
    t.join();
}

int main()
{
  int n = 1000000000;
  double* nums = new double[n];
  for (int i=0; i<n; i++)
    nums[i] = i;

  std::cout<<"go"<<std::endl;

  clock_t c1 = clock();

  struct timespec start, finish;
  double elapsed;

  clock_gettime(CLOCK_MONOTONIC, &start);

  // also try with &doJunk
  //parallelMap(nums, n, &multByTwo);
  map(nums, n, &doJunk);

  std::cout << nums[342] << std::endl;

  clock_gettime(CLOCK_MONOTONIC, &finish);

  printf("CPU elapsed time is %f seconds\n", double(clock()-c1)/CLOCKS_PER_SEC);

  elapsed = (finish.tv_sec - start.tv_sec);
  elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;

  printf("Actual elapsed time is %f seconds\n", elapsed);
}

The multByTwoparallel version is actually a bit slower (1.01 seconds versus .95 real time), and with doJunk it is faster (51 versus 136 in real time). It means that

  • parallelization works, and
  • There is REALLY big overhead with the announcement of new flows. Any thoughts on why the overhead is so high, and how can I avoid this?
+5
source share
4 answers

, , , , multByTwo , . , , , , RAM .

+7

, , . Intel Xeon 64- Scientific Linux g++ 4.7, .

Xeon X7350 :

multByTwo map

CPU elapsed time is 6.690000 seconds
Actual elapsed time is 6.691940 seconds

multByTwo parallelMap 3

CPU elapsed time is 7.330000 seconds
Actual elapsed time is 2.480294 seconds

- 2.7x.

doJunk map

CPU elapsed time is 209.250000 seconds
Actual elapsed time is 209.289025 seconds

doJunk parallelMap 3

CPU elapsed time is 220.770000 seconds
Actual elapsed time is 73.900960 seconds

- 2.83x.

, X7350 pre-Nehalem "Tigerton" FSB , . SMP NUMA.

Intel X7550. Nehalem ( "Beckton" ) Xeons , CPU, , , 4- node NUMA. , , , . , - . , :

multByTwo map

CPU elapsed time is 4.270000 seconds
Actual elapsed time is 4.264875 seconds

multByTwo map, NUMA node 0

CPU elapsed time is 4.160000 seconds
Actual elapsed time is 4.160180 seconds

multByTwo map NUMA node 0 CPU 1

CPU elapsed time is 5.910000 seconds
Actual elapsed time is 5.912319 seconds

mutlByTwo parallelMap 3

CPU elapsed time is 7.530000 seconds
Actual elapsed time is 3.696616 seconds

- 1,13x ( node). :

multByTwo parallelMap 3 , NUMA node 0

CPU elapsed time is 4.630000 seconds
Actual elapsed time is 1.548102 seconds

- 2,69x - , Tigerton.

multByTwo parallelMap 3 , NUMA node 0 CPU 1

CPU elapsed time is 5.190000 seconds
Actual elapsed time is 1.760623 seconds

2,36x - 88% .

( , doJunk Nehalems, , Tigerton)

NUMA. , , NUMA node 0 numactl --cpubind=0 --membind=0 ./program, node, , 0, .

, , , . , . , , , , .

+3

.

, .

+2
source

The emergence of new threads can be an expensive operation depending on the platform. The easiest way to avoid this overhead is to create multiple threads when the program starts and have some kind of job queue. I believe std :: async will do it for you.

0
source

All Articles