Why isn't my 4 thread implementation faster than the single thread one?

Question

I don't know much about multi-threading and I have no idea why this is happening so I'll just get to the point.

I'm processing an image and divide the image in 4 parts and pass each part to each thread(essentially I pass the indices of the first and last pixel rows of each part). For example, if the image has 1000 rows, each thread will process 250 of them. I can go in details about my implementation and what I'm trying to achieve in case it can help you. For now I provide the code executed by the threads in case you can detect why this is happening. I don't know if it's relevant but in both cases(1 thread or 4 threads) the process takes around 15ms and pfUMap and pbUMap are unordered maps.

void jacobiansThread(int start, int end,vector<float> &sJT,vector<float> &sJTJ) {

    uchar* rgbPointer;
    float* depthPointer;
    float* sdfPointer;
    float* dfdxPointer; float* dfdyPointer;
    float fov = radians(45.0);
    float aspect = 4.0 / 3.0;
    float focal = 1 / (glm::tan(fov / 2));
    float fu = focal * cols / 2 / aspect;
    float fv = focal * rows / 2;

    float strictFu = focal / aspect;
    float strictFv = focal;

    vector<float> pixelJacobi(6, 0);

    for (int y = start; y <end; y++) {
        rgbPointer = sceneImage.ptr<uchar>(y);
        depthPointer = depthBuffer.ptr<float>(y);
        dfdxPointer = dfdx.ptr<float>(y);
        dfdyPointer = dfdy.ptr<float>(y);
        sdfPointer = sdf.ptr<float>(y);
        for (int x = roiX.x; x <roiX.y; x++) {
            float deltaTerm;// = deltaPointer[x];
            float raw = sdfPointer[x];
            if (raw > 8.0)continue;
            float dirac = (1.0f / float(CV_PI)) * (1.2f / (raw * 1.44f * raw + 1.0f));
            deltaTerm = dirac;
            vec3 rgb(rgbPointer[x * 3], rgbPointer[x * 3+1], rgbPointer[x * 3+2]);
            vec3 bin = rgbToBin(rgb, numberOfBins);
            int indexOfColor = bin.x * numberOfBins * numberOfBins + bin.y * numberOfBins + bin.z;
            float s3 = glfwGetTime();
            float pF = pfUMap[indexOfColor];
            float pB = pbUMap[indexOfColor];
            float heavisideTerm;
            heavisideTerm = HEAVISIDE(raw);
            float denominator = (heavisideTerm * pF + (1 - heavisideTerm) * pB) + 0.000001;
            float commonFirstTerm = -(pF - pB) / denominator * deltaTerm;
            if (pF == pB)continue;
            vec3 pixel(x, y, depthPointer[x]);

            float dfdxTerm = dfdxPointer[x];
            float dfdyTerm = -dfdyPointer[x];

            if (pixel.z == 1) {
                cv::Point c = findClosestContourPoint(cv::Point(x, y), dfdxTerm, -dfdyTerm, abs(raw));
                if (c.x == -1)continue;
                pixel = vec3(c.x, c.y, depthBuffer.at<float>(cv::Point(c.x, c.y)));
            }

            vec3 point3D = pixel;
            pixelToViewFast(point3D, cols, rows, strictFu, strictFv);


            float Xc = point3D.x; float Xc2 = Xc * Xc; float Yc = point3D.y; float Yc2 = Yc * Yc; float Zc = point3D.z; float Zc2 = Zc * Zc;
            pixelJacobi[0] = dfdyTerm * ((fv * Yc2) / Zc2 + fv) + (dfdxTerm * fu * Xc * Yc) / Zc2;
            pixelJacobi[1] = -dfdxTerm * ((fu * Xc2) / Zc2 + fu) - (dfdyTerm * fv * Xc * Yc) / Zc2;
            pixelJacobi[2] = -(dfdyTerm * fv * Xc) / Zc + (dfdxTerm * fu * Yc) / Zc;
            pixelJacobi[3] = -(dfdxTerm * fu) / Zc;
            pixelJacobi[4] = -(dfdyTerm * fv) / Zc;
            pixelJacobi[5] = (dfdyTerm * fv * Yc) / Zc2 + (dfdxTerm * fu * Xc) / Zc2;

            float weightingTerm = -1.0 / log(denominator);
            for (int i = 0; i < 6; i++) {
                pixelJacobi[i] *= commonFirstTerm;
                sJT[i] += pixelJacobi[i];
            }
            for (int i = 0; i < 6; i++) {
                for (int j = i; j < 6; j++) {
                    sJTJ[i * 6 + j] += weightingTerm * pixelJacobi[i] * pixelJacobi[j];
                }
            }

        }
    }
}

This is the part where I call each thread:

vector<std::thread> myThreads;
    float step = (roiY.y - roiY.x) / numberOfThreads;
    vector<vector<float>> tsJT(numberOfThreads, vector<float>(6, 0));
    vector<vector<float>> tsJTJ(numberOfThreads, vector<float>(36, 0));
    for (int i = 0; i < numberOfThreads; i++) {
        int start = roiY.x+i * step;
        int end = start + step;
        if (end > roiY.y)end = roiY.y;
        myThreads.push_back(std::thread(&pwp3dV2::jacobiansThread, this,start,end,std::ref(tsJT[i]), std::ref(tsJTJ[i])));
    }

    vector<float> sJT(6, 0);
    vector<float> sJTJ(36, 0);
    for (int i = 0; i < numberOfThreads; i++)myThreads[i].join();

Other Notes

To measure time I used glfwGetTime() before and right after the second code snippet. The measurements vary but the average is about 15ms as I mentioned, for both implementations.

Thomas · Accepted Answer

Starting a thread has significant overhead, which might not be worth the time if you have only 15 milliseconds worth of work.

The common solution is to keep threads running in the background and send them data when you need them, instead of calling the std::thread constructor to create a new thread every time you have some work to do.

Why isn't my 4 thread implementation faster than the single thread one?

Tags:

c++

optimization

multithreading

opencv

parallel-processing

John Katsantas

1 Answers

Thomas

Recent Activity

Donate For Us

Why isn't my 4 thread implementation faster than the single thread one?

Tags:

c++

optimization

multithreading

opencv

parallel-processing

John Katsantas

1 Answers

Thomas

Related questions

Recent Activity

Donate For Us