I need to solve a large problem, on a large graph instances, and in order to do so I divide the input space between threads to solve indipendenlty the same function on each set of inputs. When i time to understand the scalability of my software, I notice that when I increase the number of threads used, after 4 threads the time increases. I coded a really small example to see why this happens, here it follows:
#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
template<typename T>
inline double getMs(T start, T end) {
return double(
std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
.count()) /
1000;
}
int main(int) {
std::random_device rd;
std::mt19937 g(rd());
unsigned int n = std::thread::hardware_concurrency();
std::cout << n << " concurrent threads are supported.\n";
for (size_t np = 2; np < 17; np++) {
auto start = std::chrono::high_resolution_clock::now();
std::cout << np << " threads: ";
std::vector<std::thread> threads(np);
int number_stops = 50; // memory 39420
int number_transfers = 1; // memory
int number_structures = 1; // memory
int number_iterations = 1000000; // time
auto dimension = number_stops * (number_transfers + 1) * number_structures;
auto paraTask = [&]() {
for (int b = 0; b < number_iterations; b++) {
//std::srand(unsigned(std::time(nullptr)));
std::vector<int> v(dimension, 1586)
//std::generate(v.begin(), v.end(), std::rand);
v.clear();
}
};
for (size_t i = 0; i < np; i++) {
threads[i] =
std::thread(paraTask);
}
// Join the threads
for (auto&& thread : threads) thread.join();
double elapsed = getMs(start, std::chrono::high_resolution_clock::now());
printf("parallel completed: %.3f sec.\n",
elapsed);
}
return 0;
}
Just a brief description. In order to emulate the actual software I'm working on, I use here the variables:
int number_stops = 50; // memory 39420
int number_transfers = 1; // memory
int number_structures = 1; // memory
int number_iterations = 1000000; // time
Without much details, the first three are there to simulate the memory consumption (how many vector entries I fill in each call), while the fourth one is there to simulate the number of iterations. This is here to see what causes the increasing in time, if is the memory consumption when we add threads, or if we have more problems with more computational time in each thread. (or both)
I copy down here the result with the setting above:
16 concurrent threads are supported.
2 threads: parallel completed: 0.995 sec.
3 threads: parallel completed: 1.017 sec.
4 threads: parallel completed: 1.028 sec.
5 threads: parallel completed: 1.081 sec.
6 threads: parallel completed: 1.131 sec.
7 threads: parallel completed: 1.122 sec.
8 threads: parallel completed: 1.216 sec.
9 threads: parallel completed: 1.445 sec.
10 threads: parallel completed: 1.603 sec.
11 threads: parallel completed: 1.596 sec.
12 threads: parallel completed: 1.626 sec.
13 threads: parallel completed: 1.634 sec.
14 threads: parallel completed: 1.611 sec.
15 threads: parallel completed: 1.648 sec.
16 threads: parallel completed: 1.688 sec.
So, as you can see, the time increases. Why is that. I also tried the other way around (less iteration but more memory):
int number_stops = 50; // memory 39420
int number_transfers = 100; // memory
int number_structures = 100; // memory
int number_iterations = 50; // time
and the same happens, the time increases:
16 concurrent threads are supported.
2 threads: parallel completed: 0.275 sec.
3 threads: parallel completed: 0.267 sec.
4 threads: parallel completed: 0.278 sec.
5 threads: parallel completed: 0.282 sec.
6 threads: parallel completed: 0.303 sec.
7 threads: parallel completed: 0.314 sec.
8 threads: parallel completed: 0.345 sec.
9 threads: parallel completed: 0.370 sec.
10 threads: parallel completed: 0.368 sec.
11 threads: parallel completed: 0.395 sec.
12 threads: parallel completed: 0.407 sec.
13 threads: parallel completed: 0.431 sec.
14 threads: parallel completed: 0.444 sec.
15 threads: parallel completed: 0.448 sec.
16 threads: parallel completed: 0.455 sec.
To give more context, here the specification of my computer:
Furthermore, here an hardware report from CPU-Z

My CPU has 8 physical cores, and 16 logical ones.
Using threads is not a golden hammer of perfomance. If you do not design your program well outcome adding threads will lead to pesimization.
Your threads do not do much. Just allocate and deallocate memory. Heap is shared state, which is synchronized. This means that besides spawning threads overhead, you will have synchronization penalty.
So you have multiple overheads coming from threads and only thing which is parallel and not obstructed by other threads (initialization to default value) is to fast to see any gains from multiple threads. It is to fast comparing to those overhands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With